问题描述
我正在寻找一个好的程序来向我展示两个相似的pdf文件之间的区别。特别是,我正在寻找的东西不仅会在文件的ascii版本(使用”pdftotext”)上运行diff。这就是pdfdiff.py所做的。
最佳答案
您可以为此使用DiffPDF。根据描述:
DiffPDF is used to compare two PDF files. By default the comparison is of the text on each pair of pages, but comparing the appearance of pages is also supported (for example, if a diagram is changed or a paragraph reformatted). It is also possible to c> ompare particular pages or page ranges. For example, if there are two versions of a PDF file, one with pages 1-12 and the other with pages 1-13 because of an extra page having been added as page 4, they can be compared by specifying two page ranges, 1-12 for the first and 1-3, 5-13 for the second. This will make DiffPDF compare pages in the pairs (1, 1), (2, 2), (3, 3), (4, 5), (5, 6), and so on, to (12, 13).
次佳答案
我只是想出一种使DiffPDF(@qbi建议的程序)可用于微小更改的黑客。我要做的是使用pdfjam将所有页面pdf连接成一个长滚动,然后比较滚动。即使删除或插入大块也可以使用!
这是一个执行该任务的bash脚本:
#!/bin/bash
#
# Compare two PDF files.
# Dependencies:
# - pdfinfo (xpdf)
# - pdfjam (texlive-extra-utils)
# - diffpdf
#
MAX_HEIGHT=15840 #The maximum height of a page (in points), limited by pdfjam.
TMPFILE1=$(mktemp /tmp/XXXXXX.pdf)
TMPFILE2=$(mktemp /tmp/XXXXXX.pdf)
usage="usage: scrolldiff -h FILE1.pdf FILE2.pdf
-h print this message
v0.0"
while getopts "h" OPTIONS ; do
case ${OPTIONS} in
h|-help) echo "${usage}"; exit;;
esac
done
shift $(($OPTIND - 1))
if [ -z "$1" ] || [ -z "$2" ] || [ ! -f "$1" ] || [ ! -f "$2" ]
then
echo "ERROR: input files do not exist."
echo
echo "$usage"
exit
fi
#Get the number of pages:
pages1=$( pdfinfo "$1" | grep 'Pages' - | awk '{print $2}' )
pages2=$( pdfinfo "$2" | grep 'Pages' - | awk '{print $2}' )
numpages=$pages2
if [[ $pages1 > $pages2 ]]
then
numpages=$pages1
fi
#Get the paper size:
width1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $3}' )
height1=$( pdfinfo "$1" | grep 'Page size' | awk '{print $5}' )
width2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $3}' )
height2=$( pdfinfo "$2" | grep 'Page size' | awk '{print $5}' )
if [ $(bc <<< "$width1 < $width2") -eq 1 ]
then
width1=$width2
fi
if [ $(bc <<< "$height1 < $height2") -eq 1 ]
then
height1=$height2
fi
height=$( echo "scale=2; $height1 * $numpages" | bc )
if [ $(bc <<< "$MAX_HEIGHT < $height") -eq 1 ]
then
height=$MAX_HEIGHT
fi
papersize="${width1}pt,${height}pt"
#Make the scrolls:
pdfj="pdfjam --nup 1x$numpages --papersize {${papersize}} --outfile"
$pdfj "$TMPFILE1" "$1"
$pdfj "$TMPFILE2" "$2"
diffpdf "$TMPFILE1" "$TMPFILE2"
rm -f $TMPFILE1 $TMPFILE2
第三种答案
即使这不能直接解决问题,这也是从命令行以很少依赖项完成所有操作的好方法:
diff <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)
https://linux.die.net/man/1/pdftotext
对于基本的pdf比较,它确实非常有效。如果您拥有较新的pdftotext版本,则可以尝试使用-bbox
而不是-layout
。
就diff程序而言,我喜欢使用diffuse,因此命令更改的幅度很小:
diffuse <(pdftotext -layout old.pdf /dev/stdout) <(pdftotext -layout new.pdf /dev/stdout)
http://diffuse.sourceforge.net/
希望能有所帮助。
第四种答案
如果您有2-3个巨大的pdf(或epub或其他格式,请阅读下面)文件进行比较,则可以组合以下功能:
-
口径(将您的来源转换为文本)
-
融合(以视觉方式搜索文本文件之间的差异)
-
并行(使用所有系统核心来加速)
下面的脚本接受以下任何文件格式作为输入:MOBI,LIT,PRC,EPUB,ODT,HTML,CBR,CBZ,RTF,TXT,PDF和LRS。
如果未安装,则安装融合,口径和平行:
#install packages
sudo apt-get -y install meld calibre parallel
为了能够在您计算机的任何位置执行代码,请将以下代码保存在目录”/usr/local/bin”内名为”diffepub”的文件中(无扩展名)。
usage="
*** usage:
diffepub - compare text in two files. Valid format for input files are:
MOBI, LIT, PRC, EPUB, ODT, HTML, CBR, CBZ, RTF, TXT, PDF and LRS.
diffepub -h | FILE1 FILE2
-h print this message
Example:
diffepub my_file1.pdf my_file2.pdf
diffepub my_file1.epub my_file2.epub
v0.2 (added parallel and 3 files processing)
"
#parse command line options
while getopts "h" OPTIONS ; do
case ${OPTIONS} in
h|-help) echo "${usage}"; exit;;
esac
done
shift $(($OPTIND - 1))
#check if first 2 command line arguments are files
if [ -z "$1" ] || [ -z "$2" ] || [ ! -f "$1" ] || [ ! -f "$2" ]
then
echo "ERROR: input files do not exist."
echo
echo "$usage"
exit
fi
#create temporary files (first & last 10 characters of
# input files w/o extension)
file1=`basename "$1" | sed -r -e '
s/\..*$// #strip file extension
s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars
s/$/_XXX.txt/ #add tmp file extension
'`
TMPFILE1=$(mktemp --tmpdir "$file1")
file2=`basename "$2" | sed -r -e '
s/\..*$// #strip file extension
s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars
s/$/_XXX.txt/ #add tmp file extension
'`
TMPFILE2=$(mktemp --tmpdir "$file2")
if [ "$#" -gt 2 ]
then
file3=`basename "$3" | sed -r -e '
s/\..*$// #strip file extension
s/(^.{1,10}).*(.{10})/\1__\2/ #take first-last 10 chars
s/$/_XXX.txt/ #add tmp file extension
'`
TMPFILE3=$(mktemp --tmpdir "$file3")
fi
#convert to txt and compare using meld
doit(){ #to solve __space__ between filenames and parallel
ebook-convert $1
}
export -f doit
if [ "$#" -gt 2 ]
then
(parallel doit ::: "$1 $TMPFILE1" \
"$2 $TMPFILE2" \
"$3 $TMPFILE3" ) &&
(meld "$TMPFILE1" "$TMPFILE2" "$TMPFILE3")
else
(parallel doit ::: "$1 $TMPFILE1" \
"$2 $TMPFILE2" ) &&
(meld "$TMPFILE1" "$TMPFILE2")
fi
确保所有者是您的用户,并且具有执行权限:
sudo chown $USER:$USER /usr/local/bin/diffepub
sudo chmod 700 /usr/local/bin/diffepub
要测试它,只需键入:
diffepub FILE1 FILE2
我对其进行了测试,以比较+1600页pdf的2个修订版本,并且效果很好。由于口径是使用python编写的,以实现可移植性,因此花了10分钟将两个文件都转换为文本。缓慢但可靠。