This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.
NOTE: This program has been designed with python 3.10 and later in mind.
Download this repository:
git clone https://github.com/MaxAFriedrich/pdfParser
cd pdfParser
Then run the program, providing files as arguments.
python pdfParser.py /location/of/pdf/filename.pdf
It may be useful to alias this program so you can run it from other location in your environment.
- Automatic built in OCR scanning
- Remove tables and diagrams
- Output options
- Convert the pdf to markdown
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.