Skip to content

This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.

License

Notifications You must be signed in to change notification settings

MaxAFriedrich/pdfParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

pdfParser

This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.

Quick start

NOTE: This program has been designed with python 3.10 and later in mind.

Download this repository:

git clone https://github.com/MaxAFriedrich/pdfParser
cd pdfParser

Then run the program, providing files as arguments.

python pdfParser.py /location/of/pdf/filename.pdf

It may be useful to alias this program so you can run it from other location in your environment.

TODO

  • Automatic built in OCR scanning
  • Remove tables and diagrams
  • Output options
  • Convert the pdf to markdown

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

About

This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.

Resources

License

Stars

Watchers

Forks

Languages