我如何提取页面范围/PDF的一部分？

问题描述

你有任何想法如何提取PDF文档的一部分，并将其保存为PDF？在OS X上，通过使用预览是绝对微不足道的。我尝试了PDF编辑器和其他程序，但无济于事。

我想要一个程序，在其中选择我想要的部分，然后在OS X上用CMD + N这样的简单命令将其保存为pdf。我希望提取的部分以PDF格式保存，而不是jpeg等。

最佳解决方案

pdftk是用于该作业的有用的multi-platform工具(pdftk homepage)。

pdftk full-pdf.pdf cat 12-15 output outfile_p12-15.pdf

你传递主pdf的文件名，然后告诉它只包含某些页面(本例中为12-15)并将其输出到新文件。

次佳解决方案

很简单，使用默认的PDF阅读器：

打印为文件。这就对了！

pdf,ubuntu

然后

pdf,ubuntu

第三种解决方案

页面范围 – Nautilus脚本

概观

我根据链接的教程@ThiagoPonte创建了一个稍微更高级的脚本。它的主要特点是

它是基于GUI的，
与文件名称中的空格兼容，
并基于三个不同的后端，能够保留原始文件的所有属性

截图

pdf,ubuntu

码

#!/bin/bash
#
# TITLE:        PDFextract
#
# AUTHOR:       (c) 2013-2015 Glutanimate (https://github.com/Glutanimate)
#
# VERSION:      0.2
#
# LICENSE:      GNU GPL v3 (http://www.gnu.org/licenses/gpl.html)
# 
# OVERVIEW:     PDFextract is a simple PDF extraction script based on Ghostscript/qpdf/cpdf.
#               It provides a simple way to extract a page range from a PDF document and is meant
#               to be used as a file manager script/addon (e.g. Nautilus script).
#
# FEATURES:     - simple GUI based on YAD, an advanced Zenity fork.
#               - preserves _all_ attributes of your original PDF file and does not compress 
#                 embedded images further than they are.      
#               - can choose from three different backends: ghostscript, qpdf, cpdf
#
# DEPENDENCIES: ghostscript/qpdf/cpdf poppler-utils yad libnotify-bin
#                         
#               You need to install at least one of the three backends supported by this script.
#
#               - ghostscript, qpdf, poppler-utils, and libnotify-bin are available via 
#                 the standard Ubuntu repositories
#               - cpdf is a commercial CLI PDF toolkit that is free for personal use.
#                 It can be downloaded here: https://github.com/coherentgraphics/cpdf-binaries
#               - yad can be installed from the webupd8 PPA with the following command:
#                 sudo add-apt-repository ppa:webupd8team/y-ppa-manager && apt-get update && apt-get install yad
#
# NOTES:        Here is a quick comparison of the advantages and disadvantages of each backend:
#
#                               speed     metadata preservation     content preservation        license
#               ghostscript:     --               ++                         ++               open-source
#               cpdf:             -               ++                         ++               proprietary
#               qpdf:            ++                +                         ++               open-source
#
#               Results might vary depending on the document and the version of the tool in question.
#
# INSTALLATION: https://askubuntu.com/a/236415
#
# This script was inspired by Kurt Pfeifle's PDF extraction script 
# (http://www.linuxjournal.com/content/tech-tip-extract-pages-pdf)
#
# Originally posted on askubuntu
# (https://askubuntu.com/a/282453)

# Variables

DOCUMENT="$1"
BACKENDSELECTION="^qpdf!ghostscript!cpdf"

# Functions

check_input(){
  if [[ -z "$1" ]]; then
    notify "Error: No input file selected."
    exit 1
  elif [[ ! "$(file -ib "$1")" == *application/pdf* ]]; then
    notify "Error: Not a valid PDF file."
    exit 1
  fi
}

check_deps () {
  for i in "$@"; do
    type "$i" > /dev/null 2>&1 
    if [[ "$?" != "0" ]]; then
      MissingDeps+="$i"
    fi
  done
}

ghostscriptextract(){
  gs -dFirstPage="$STARTPAGE "-dLastPage="$STOPPAGE" -sOutputFile="$OUTFILE" -dSAFER -dNOPAUSE -dBATCH -dPDFSETTING=/default -sDEVICE=pdfwrite -dCompressFonts=true -c \
  ".setpdfwrite << /EncodeColorImages true /DownsampleMonoImages false /SubsetFonts true /ASCII85EncodePages false /DefaultRenderingIntent /Default /ColorConversionStrategy \
  /LeaveColorUnchanged /MonoImageDownsampleThreshold 1.5 /ColorACSImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.4 /Blend 1 >> /GrayACSImageDict \
  << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.4 /Blend 1 >> /PreserveOverprintSettings false /MonoImageResolution 300 /MonoImageFilter /FlateEncode \
  /GrayImageResolution 300 /LockDistillerParams false /EncodeGrayImages true /MaxSubsetPCT 100 /GrayImageDict << /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor \
  0.4 /Blend 1 >> /ColorImageFilter /FlateEncode /EmbedAllFonts true /UCRandBGInfo /Remove /AutoRotatePages /PageByPage /ColorImageResolution 300 /ColorImageDict << \
  /VSamples [ 1 1 1 1 ] /HSamples [ 1 1 1 1 ] /QFactor 0.4 /Blend 1 >> /CompatibilityLevel 1.7 /EncodeMonoImages true /GrayImageDownsampleThreshold 1.5 \
  /AutoFilterGrayImages false /GrayImageFilter /FlateEncode /DownsampleGrayImages false /AutoFilterColorImages false /DownsampleColorImages false /CompressPages true \
  /ColorImageDownsampleThreshold 1.5 /PreserveHalftoneInfo false >> setdistillerparams" -f "$DOCUMENT"
}

cpdfextract(){
  cpdf "$DOCUMENT" "$STARTPAGE-$STOPPAGE" -o "$OUTFILE"
}

qpdfextract(){
  qpdf --linearize "$DOCUMENT" --pages "$DOCUMENT" "$STARTPAGE-$STOPPAGE" -- "$OUTFILE"
  echo "$OUTFILE"
  return 0 # even benign qpdf warnings produce error codes, so we suppress them
}

notify(){
  echo "$1"
  notify-send -i application-pdf "PDFextract" "$1"
}

dialog_warning(){
  echo "$1"
  yad --center --image dialog-warning \
  --title "PDFExtract Warning" \
  --text "$1" \
  --button="Try again:0" \
  --button="Exit:1"

  [[ "$?" != "0" ]] && exit 0
}

dialog_settings(){
  PAGECOUNT=$(pdfinfo "$DOCUMENT" | grep Pages | sed 's/[^0-9]*//') #determine page count

  SETTINGS=($(\
      yad --form --width 300 --center \
          --window-icon application-pdf --image application-pdf \
          --separator=" " --title="PDFextract"\
          --text "Please choose the page range and backend"\
          --field="Start:NUM" 1[!1..$PAGECOUNT[!1]] --field="End:NUM" $PAGECOUNT[!1..$PAGECOUNT[!1]] \
          --field="Backend":CB "$BACKENDSELECTION" \
          --button="gtk-ok:0" --button="gtk-cancel:1"\
      ))

  SETTINGSRET="$?"

  [[ "$SETTINGSRET" != "0" ]] && exit 1

  STARTPAGE=$(printf %.0f ${SETTINGS[0]}) #round numbers and store array in variables
  STOPPAGE=$(printf %.0f ${SETTINGS[1]})
  BACKEND="${SETTINGS[2]}"
  EXTRACTOR="${BACKEND}extract"

  check_deps "$BACKEND"

  if [[ -n "$MissingDeps" ]]; then
    dialog_warning "Error, missing dependency: $MissingDeps"
    unset MissingDeps
    dialog_settings
    return
  fi

  if [[ "$STARTPAGE" -gt "$STOPPAGE" ]]; then 
    dialog_warning "<b>   Start page higher than stop page.   </b>"
    dialog_settings
    return
  fi

  OUTFILE="${DOCUMENT%.pdf} (p${STARTPAGE}-p${STOPPAGE}).pdf"
}

extract_pages(){
  $EXTRACTOR
  EXTRACTORRET="$?"
  if [[ "$EXTRACTORRET" = "0" ]]; then
    notify "Pages $STARTPAGE to $STOPPAGE succesfully extracted."
  else
    notify "There has been an error. Please check the CLI output."
  fi
}


# Main

check_input "$1"
dialog_settings
extract_pages

安装

请按照generic installation instructions for Nautilus scripts。请务必仔细阅读脚本标题，因为这有助于阐明脚本的安装和使用。

部分页面 – PDF Shuffler

概观

PDF-Shuffler is a small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface. It is a frontend for python-pyPdf.

安装

sudo apt-get install pdfshuffler

用法

PDF-Shuffler可以裁剪和删除单个PDF页面。您可以使用裁剪功能从文档或甚至部分页面中提取页面范围：

pdf,ubuntu

页面元素 – Inkscape

概观

Inkscape是一个非常强大的open-source矢量图形编辑器。它支持多种不同的格式，包括PDF文件。您可以使用它从PDF文件中提取，修改和保存页面元素。

安装

sudo apt-get install inkscape

用法

1.)用Inkscape打开您选择的PDF文件。导入对话框将出现。选择你想从中提取元素的页面。保持其他设置不变：

pdf,ubuntu

2.)在Inkscape中单击并拖动以选择要提取的元素：

pdf,ubuntu

3.)用!反转选择并用DELETE删除选定的对象：

pdf,ubuntu

4.)通过使用CTRL + SHIFT + D访问“文档属性”对话框并选择“使文档适合图像”，将文档裁剪到其余对象上：

pdf,ubuntu

5.)将文档保存为PDF文件，从文件 – >另存为对话框：

pdf,ubuntu

6.)如果裁剪文档中有位图/光栅图像，您可以在下一个出现的对话框中设置它们的DPI：

pdf,ubuntu

7.)如果你遵循所有步骤，你将产生一个真正的PDF文件，它只包含你选择的对象：

pdf,ubuntu

第四种方案

将其另存为shell脚本，如pdfextractor.sh：

#!/bin/bash
# this function uses 3 arguments:
#     $1 is the first page of the range to extract
#     $2 is the last page of the range to extract
#     $3 is the input file
#     output file will be named "inputfile_pXX-pYY.pdf"
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
   -dFirstPage=${1} \
   -dLastPage=${2} \
   -sOutputFile=${3%.pdf}_p${1}-p${2}.pdf \
   ${3}

要运行类型：

./pdfextractor.sh 4 20 myfile.pdf

1)4是指它将开始新PDF的页面。

2)20指的是它将以PDF结尾的页面。

3)myfile.pdf是你想要提取部分的pdf文件。

输出将是myfile_p4_p20.pdf与原始pdf文件在同一目录中。

所有这些和更多信息在这里：Tech Tip

第五种方案

QPDF很棒。用这种方法从input.pdf中提取第1-10页并将其保存为output.pdf。

qpdf --pages input.pdf 1-10 -- input.pdf output.pdf

请注意input.pdf写入两次。

您可以通过调用以下来安装它：

apt-get install qpdf

或者，转到Ubuntu应用目录：

这是一个很棒的PDF操作工具，它非常快速，依赖性很少。 “它可以对文件进行加密和线性化处理，公开PDF文件的内部结构，并执行许多其他对最终用户和PDF开发人员有用的操作。”

http://sourceforge.net/projects/qpdf/

第六种方案

有一个名为pdfseparate的命令行实用程序。

从文档：

pdfseparate sample.pdf sample-%d.pdf

extracts  all pages from sample.pdf, if i.e. sample.pdf has 3 pages, it
   produces

sample-1.pdf, sample-2.pdf, sample-3.pdf

或者，从文件sample.pdf中选择单个页面(在本例中为第一页)：

pdfseparate -f 1 -l 1 sample.pdf sample-1.pdf

第七种方案

pdftk(sudo apt-get install pdftk)也是PDF操作的一个很棒的命令行。以下是pdftk可以执行的一些示例：

   Collate scanned pages
     pdftk A=even.pdf B=odd.pdf shuffle A B output collated.pdf
     or if odd.pdf is in reverse order:
     pdftk A=even.pdf B=odd.pdf shuffle A Bend-1 output collated.pdf

   Join in1.pdf and in2.pdf into a new PDF, out1.pdf
     pdftk in1.pdf in2.pdf cat output out1.pdf
     or (using handles):
     pdftk A=in1.pdf B=in2.pdf cat A B output out1.pdf
     or (using wildcards):
     pdftk *.pdf cat output combined.pdf

   Remove page 13 from in1.pdf to create out1.pdf
     pdftk in.pdf cat 1-12 14-end output out1.pdf
     or:
     pdftk A=in1.pdf cat A1-12 A14-end output out1.pdf

   Burst a single PDF document into pages and dump its data to
   doc_data.txt
     pdftk in.pdf burst

   Rotate the first PDF page to 90 degrees clockwise
     pdftk in.pdf cat 1east 2-end output out.pdf

   Rotate an entire PDF document to 180 degrees
     pdftk in.pdf cat 1-endsouth output out.pdf

就你而言，我会这样做：

     pdftk A=input.pdf cat A<page_range> output output.pdf

第八种方案

在任何安装了TeX发行版的系统中：

pdfjam <input file> <page ranges> -o <output file>

例如：

pdfjam original.pdf 5-10 -o out.pdf

参见https://tex.stackexchange.com/a/79626/8666

第九种方案

你有没有试过PDF Mod？

例如，您可以提取页面并将它们保存为pdf。

描述：

PDF Mod是一个用于修改PDF文档的简单工具。它可以通过拖放来旋转，提取，移除和重新排列页面。多个文档可以通过拖放组合。您也可以使用PDF Mod编辑PDF文档的标题，主题，作者和关键字。

希望这会有用。

Regars。

第十种方案

事实证明，我可以用imagemagick做到这一点。如果你没有它，只需安装：

sudo apt-get install imagemagick

注1：我用one-page pdf(我正在学习使用imagemagick，因此我不想要更多的麻烦)。我不知道它是否会/如何与多个页面一起工作，但是您可以使用pdftk提取一页感兴趣的页面：

pdftk A=myfile.pdf cat A1 output page1.pdf

您在哪里指示要拆分的页码(在上例中，A1选择第一页)。

注2：使用这个过程产生的图像将是一个光栅。

使用命令display(imagemagick套件的一部分)打开pdf：

display file.pdf

矿看起来像这样：

点击图片查看完整分辨率版本

现在你点击窗口，一个菜单就会弹出到一边。在那里，选择Transform |裁剪。

pdf,ubuntu

返回主窗口，您可以通过简单地拖动指针来选择想要裁剪的区域(传统的corner-to-corner选择)。

选择时注意图像周围的hand-shaped指针

pdf,ubuntu

在继续下一步之前，可以细化这个选择。

完成后，注意左上角出现的小矩形(请参阅上图)。它显示了第一个选定区域的尺寸(例如281x218)，第二个显示了第一个角点的坐标(例如+256+215)。

写下所选区域的尺寸;在保存裁剪后的图像时您将需要它。

现在回到弹出式菜单(现在是特定的”crop”菜单)，单击Crop按钮。

pdf,ubuntu

最后，一旦你对裁剪结果满意，点击菜单File |保存

导航到要保存裁剪的pdf文件的文件夹，输入名称，单击按钮格式，在“选择图像格式类型”窗口中选择PDF并单击按钮选择。回到“浏览并选择一个文件”窗口，点击保存按钮。

pdf,ubuntu

保存之前，imagemagick将要求“选择页面几何体”。在这里，您可以使用简单的字母”x”键入裁剪图像的尺寸来分隔宽度和高度。

pdf,ubuntu

现在，您可以从命令行完成所有这些操作(命令是convert，选项为-crop) – 确实速度更快，但您必须事先知道要提取的图像的坐标。检查man convert和an example in their webpage。

参考资料

How can I extract a page range / a part of a PDF?