pdfParser

This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.

Quick start

NOTE: This program has been designed with python 3.10 and later in mind.

Download this repository:

git clone https://github.com/MaxAFriedrich/pdfParser
cd pdfParser

Then run the program, providing files as arguments.

python pdfParser.py /location/of/pdf/filename.pdf

It may be useful to alias this program so you can run it from other location in your environment.

TODO

Automatic built in OCR scanning
Remove tables and diagrams
Output options
Convert the pdf to markdown

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENCE		LICENCE
README.md		README.md
pdfParser.py		pdfParser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfParser

Quick start

TODO

License

About

Languages

License

MaxAFriedrich/pdfParser

Folders and files

Latest commit

History

Repository files navigation

pdfParser

Quick start

TODO

License

About

Resources

License

Stars

Watchers

Forks

Languages