Skip to content Skip to sidebar Skip to footer

How Can I Unittest Whether Pdf Files Have Been Generated Correctly?

I write a small python library that uses matplotlib and seaborn to draw charts, and I wonder how I can test whether the charts look like what I actually want. Thus, given a referen

Solution 1:

Some ideas:

  • Use diff-pdf
  • Convert it to an image (e.g. using ImageMagick) and use PerceptualDiff
  • Get the data out of the PDF somehow (PyPDF2 maybe?) and compare that
  • Use something (PyPDF2? pdftk?) to patch the header information (like timestamps) to the point the files are equal and compare hashes

Solution 2:

For use with regression testing, I have written diffpdf.sh to perform a page-by-page visual diff for PDFs. It makes use of ImageMagick and the Poppler PDF utilities pdftoppm and pdfinfo.

diffpdf.sh will output a non-zero return code if the PDFs do not display identically and print the page numbers for the pages that differ, along with a number that reflects how much the pages differ. A visual diff image for each page is also saved to the pdfdiff directory.

#!/bin/bash# usage: diffpdf.sh fidle_1.pdf file_2.pdf# requirements:# - ImageMagick# - Poppler's pdftoppm and pdfinfo tools (works with 0.18.4 and 0.41.0,#                                         fails with 0.42.0)

DIFFDIR="pdfdiff"# directory to place diff images in
MAXPROCS=$(getconf _NPROCESSORS_ONLN)    # number of parallel processes

pdf_file1=$1
pdf_file2=$2function diff_page {
    # based on http://stackoverflow.com/a/33673440/438249
    pdf_file1=$1
    pdf_file2=$2
    page_number=$3
    page_index=$(($page_number - 1))

    (cat$pdf_file1 | pdftoppm -f $page_number -singlefile -gray - | convert - miff:- ; \
     cat$pdf_file2 | pdftoppm -f $page_number -singlefile -gray - | convert - miff:- ) | \
    convert - \( -clone 0-1 -compose darken -composite \) \
            -channel RGB -combine $DIFFDIR/$page_number.jpg

    if (($? > 0)); thenecho"Problem running pdftoppm or convert!"exit 1
    fi
    grayscale=$(convert pdfdiff/$page_number.jpg -colorspace HSL -channel g -separate +channel -format "%[fx:mean]" info:)
    if [ "$grayscale" != "0" ]; thenecho"page $page_number ($grayscale)"return 1
    fireturn 0
}

function num_pages {
    pdf_file=$1

    pdfinfo $pdf_file | grep "Pages:" | awk '{print $2}'
}

function minimum {
    echo $(( $1 < $2 ? $1 : $2 ))
}

# guard agains accidental deletion of files in the root directoryif [ -z "$DIFFDIR" ]; thenecho"DIFFDIR needs to be set!"exit 1
fiecho"Running $MAXPROCS processes in parallel"

pdf1_num_pages=$(num_pages $pdf_file1)
pdf2_num_pages=$(num_pages $pdf_file2)

min_pages=$(minimum $pdf1_num_pages$pdf2_num_pages)

if [ "$pdf1_num_pages" -ne "$pdf2_num_pages" ]; thenecho"PDF files have different lengths ($pdf1_num_pages and $pdf2_num_pages)"
    rc=1
fiif [ -d "$DIFFDIR" ]; thenrm -f $DIFFDIR/*
elsemkdir$DIFFDIRfi# get exit status from subshells (http://stackoverflow.com/a/29535256/438249)function wait_for_processes {
    local rc=0

    while (( "$#" )); do# wait returns the exit status for the processif ! wait"$1"; then
            rc=1
        fishiftdonereturn$rc
}

functionhowmany() {
    echo$#
}

rc=0
pids=""for page_number in `seq 1 $min_pages`;
do
    diff_page $pdf_file1$pdf_file2$page_number &
    pids+=" $!"if [ $(howmany $pids) -eq "$MAXPROCS" ]; thenif ! wait_for_processes $pids; then
            rc=1
        fi
        pids=""fidoneif ! wait_for_processes $pids; then
    rc=1
fiexit$rc

EDIT: An improved version of this script, written in Python can be found here.

Post a Comment for "How Can I Unittest Whether Pdf Files Have Been Generated Correctly?"