Skip to content

Instantly share code, notes, and snippets.

@bartoszek
Last active December 28, 2020 17:22
Show Gist options
  • Save bartoszek/398cfa45a42221927db914080e9ed79e to your computer and use it in GitHub Desktop.
Save bartoszek/398cfa45a42221927db914080e9ed79e to your computer and use it in GitHub Desktop.
unscramble pdf on gcloud/wsl
#!/bin/bash
#depends
[[ $# != 1 ]] && { echo "useage: $(basename $0) pdf_file" >&2; exit 10; }
hash pdftoppm pdfunite tesseract || sudo apt install -y poppler-utils tesseract-ocr{,-pol}
#tmp
tmp=$(mktemp -d)
trap "rm -rf $tmp" EXIT
#pdf->png
echo "Resterizing ..." >&2
pdftoppm -png "$1" "$tmp/${1%.pdf}" 2>&1
echo "OCRing ..." >&2
#png->pdf
imgs=("$tmp/${1%.pdf}"*.png)
for img in "${imgs[@]}"; do
echo -en "Page: $((++i))/${#imgs[@]}\r" >&2
tesseract -l pol --psm 1 --oem 1 "$img" "${img%.png}" pdf 1>/dev/null 2>&1
done
#concat pdfs
echo "Concating ..." >&2
pdfunite "$tmp/${1%.pdf}"*.pdf "${1%.pdf}".copy.pdf 2>&1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment