Skip to content

Make PDFs screen reader friendly with OCR

Cristos L-C edited this page Jun 25, 2020 · 2 revisions

Make PDFs Screen-Reader Friendly with OCR

Table of Contents

Purpose

When you're lucky enough to find a town or city that has their sample ballot published, it may still be missing the text layer (OCR) that makes it accessible to tools like screen readers. You can easily convert such PDFs (or raw JPG, PNG, or TIFF images) to an accessible format with some free command-line tools.

Pre-Requisites

This guide assumes you are working on a computer running macOS Mojave or later. The process is similar for Linux computers, but initial setup will be different. Tool availability on Windows computers has not been confirmed.

Initial Setup

  1. Install Homebrew by following the instructions at https://brew.sh/
  2. Launch the Terminal app to access the command line
  3. Run the following command: brew update && brew cask install homebrew/cask-versions/adoptopenjdk8 && brew install pdftk-java ocrmypdf
  4. If you want to add a "Sample" watermark overlay on your ballots, download the sample.pdf file.

Converting PDFs Manually

  1. For each PDF you want to convert, run the following command in Terminal: ocrmypdf --output-type pdf --sidecar --remove-background --deskew --remove-vectors "ORIGINALFILE.pdf" "DESTINATIONFILE.pdf"
  2. To add a "Sample" watermark overlaid on the ballot, run the following command: pdftk "DESTINATIONFILE.pdf" stamp sample.pdf output "DESTINATIONFILE_WATERMARKED.pdf"
    • NOTE: You cannot use the same filename as both the input and output for the watermark command. You must make sure that the filename after output is different from the filename after pdftk in the above command.

Converting and Watermarking PDFs Automatically

To simplify the process, you can use a bash script that performs both OCR conversion and watermarking for you.

  1. Download ocr-and-watermark.sh
  2. In Terminal, locate the downloaded file and type sudo chmod +x ocr-and-watermark.sh.
  3. To use it, type in Terminal: ./ocr-and-watermark.sh "SOURCEFILE.pdf" "DESTINATIONFILE.pdf"

Further Improvements

  • If you have a large number of files to convert, you can use a bash script to loop through all the files and perform the OCR and/or watermarking operations.