Deskewing Scanned Notes

Introduction

The most recent version of this scrip is located here.

One nice way to save old notes is to scan them to a PDF. However, not all scanners make the notes correctly orientated along with also taking up a lot of space. There are several methods to make notes look better like unpaper or pdfsandwich. It is also possible to enhance notes such as using mzucker’s noteshrink approach or use ImageMagick to do the conversion such as lelandbatey’s whiteboard cleaner, but this last method can be quite slow on single images.

This solution mostly focuses on black and white notes, and can be changed to use color. It can shrink a pdf to about 20% of its size depending on how many lines there are on the page. It can also deskew notes that unpaper can not deskew, and can be modified very easily.

Examples

The image on the left is the original image (JPG 210.4 KiB, PNG 680.9 KiB) and the image on the right is the deskewed image (PNG 171.9 KiB). The image to PDF method (img2pdf) only adds an additional 500 to 700 Bytes to the overall file size, and as such does not reduce the image quality (unlike ImageMagick).

Another example is when using the dark border removal, which happens when the background of the scanner uses black instead of white.

Like the first example, the image on the left is the original image (JPG 354.5 KiB, PNG 1.6 MiB) and the image on the right is the converted image (PNG 44.0 KiB).

Running deskew.sh

This script uses bash, mktemp (from coreutils), pdfinfo and pdftoppm (from poppler-utils), convert (from imagemagick), img2pdf, gs (from ghostscript), and exiftool (from libimage-exiftool-perl).

This tool also comes with several command line switches, along with defaults used when the switch is not present.

deskew.sh
#!/bin/bash
#https://stackoverflow.com/a/16496491
usage() { echo "Usage: $0 [-b 0-20] [-o outputfile] [-d 72-600] [-r] [-p ppm,pgm,pbm] inputfile
    defaults: $0 -b 5% -d 300 -p pgm inputfile
    example: for i in *.pdf; do ./deskew \"\$i\"; done
    get image dpi size: pdfimages -list file.pdf" 1>&2; exit 1;}
#h = check w/o param, h: = check w/ param
while getopts "hb:d:o:r" arg; do
    case "${arg}" in
        b) b=${OPTARG} ;;
        d) d=${OPTARG} ;;
        o) p=${OPTARG} ;;
        p) p=${OPTARG} ;;
        r) r=1 ;;
        h) usage; exit 0 ;;
    esac
done
#remove optargs
shift $(($OPTIND-1))

inputPDF="$1"
if [[ ! -f "$inputPDF" ]]; then
    echo "File not found"
    exit 0
fi
tempPDF=temp
outputPDF="${inputPDF%.pdf}_skew.pdf"
border="5%"
if [[ ! -z "${b}" ]]; then
    border="${b}%"
fi
dpi=325 #set dpi
if [[ ! -z "${d}" ]]; then
    dpi="${d}"
fi
#https://en.wikipedia.org/wiki/Netpbm_format
fmt="pgm"
if [[ ! -z "${p}" ]]; then
    fmt="${p}"
fi

#create temp
currentDir="$(pwd)"
tempDir="$(mktemp -d --tmpdir=$currentDir)"

#output file name
pages="$(pdfinfo "$inputPDF" | grep Pages: | awk '{print $2}')"
echo "Deskewing $inputPDF, $pages pages with dpi $dpi"

#convert to pgm
pdftoppm -gray -r "$dpi" "$inputPDF" "$tempDir/$tempPDF"

for i in "$tempDir"/temp-*."$fmt"; do
    iPDF="${i%.$fmt}.pdf"
    echo "Converting $(basename $tempDir)/$(basename $i) to $(basename $iPDF)"
    #remove black border - if r exists
    if [[ ! -z "$r" ]]; then
        convert -density "$dpi" \
            -bordercolor black -border 1 \
            -fuzz 10% -fill white -draw 'color 0,0 floodfill' \
            "$i" "$i"
    fi
    #deskew, clean up image, trim blank part
    convert -density "$dpi" \
        -deskew 80% -background white \
        -lat 25x25-"$border" \
        +repage -fuzz 20% -trim +repage \
        "$i" "$i"

    #check return code if not successful
    if [ "$?" -ne 0 ]; then
        echo "Convert failed, exiting"
        rm -rf "$tempDir"
        exit 1
    fi

    #convert to pdf- binary img map to pdf
    pgSize="$(identify -format '%wx%h' $i)"
    img2pdf --title "${imputPDF%.pdf}" --imgsize "$dpi"dpi --output "$iPDF" "$i"
    #rm "$i"
done

#rename file
mv "$inputPDF" "$outputPDF"

#to ghostscript, prepress - compress dpi > 300 -dPDFSETTINGS=/prepress
echo "Combining files to $inputPDF, setting title to ${inputPDF%.pdf}"
gs -sDEVICE=pdfwrite -dNOPAUSE -dQUIET -dBATCH -r"$dpi" \
    -sOutputFile="$inputPDF" "$tempDir"/temp-*.pdf
exiftool -Title="${inputPDF%.pdf}" -overwrite_original_in_place "$inputPDF"

#rename if setting
if [[ ! -z "${o}" ]]; then
    mv "$inputPDF" "${o}"
    mv "$outputPDF" "$inputPDF"
fi
#delete temp files
echo "Deleting temp files"
rm -rf "$tempDir"