将 OCR 信息添加到 PDF

问题描述

我对文件进行了质量良好的扫描；此类扫描件为 pdf 格式。

如何将 OCR 信息添加到 PDF，以便它可以搜索？可搜索我的意思是目标是当用 evince 查看 pdf 时，CTRL-F 实际上允许我在 pdf 内容中搜索。

最佳思路

pdfsandwich

做你想做的并提供 Ubuntu deb 包。它使用 tesseract 作为 OCR 引擎。以下调用将文本层添加到扫描的 PDF 中：

pdfsandwich scanned.pdf

以下是相同的，但使用另一种语言(ISO 639-2 代码，下载 tesseract-ocr-LANGCODE 包)并设置布局：

pdfsandwich  -verbose -lang spa -layout single scanned.pdf

如果您遇到任何错误，请 download last version deb from Sourceforge 。

免责声明：我是 pdfsandwich 的开发者，因此显然有偏见。

次佳思路

有两个项目可以解决问题：GScan2PDF 和 OCRFeeder

第三种思路

OCRmyPDF 是一个易于实施并提供与输入文件质量相同且大小合理的输出 pdf 的解决方案：

\\n

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

\\n

ocrmypdf                      # it's a scriptable command line program\\n   -l eng+fra                 # it supports multiple languages\\n   --rotate-pages             # it can fix pages that are misrotated\\n   --deskew                   # it can deskew crooked PDFs!\\n   --title "My PDF"           # it can change output metadata\\n   --jobs 4                   # it uses multiple cores by default\\n   --output-type pdfa         # it produces PDF/A by default\\n   input_scanned.pdf          # takes PDF input (or images)\\n   output_searchable.pdf      # produces validated PDF output\\n

\\n

第四种思路

我找到了一个不理想但非常有效的解决方案。

我通过 Wine 使用 PDF X-Change Viewer。它具有 OCR 功能，可将文本层添加到现有的基于图像的 pdf 中。

因此，您可以从这个不可见层搜索和复制文本。

参考资料

Adding OCR info to a PDF