pdf_table_to_csv

capture tables in a pdf and output a csv.

This project uses

tesseract-ocr tabula pdftools ghostscript

What does it do

Takes a file in files folder and triese process.sh script takes an s3 URL as input, which points to a pdf. Image only pdfs are converted to images, run through tesseract OCR and then converted back into a pdf with an image and a text layer. Pdfs with a text layer are run through tabula-java which guesses where tables are, and converts them to a csv file.

How to use pdf_table_to_csv

###Build the container

docker build . --tag pdf_table_to_csv

###To Run

Put any files you want to process in a ./files folder.

mkdir -p ./files 
# Copy PDF files to the files folder
docker run --rm -v ${PWD}/files:/files -u $(id -u ${USER}):$(id -g ${USER}) valveless/pdf_table_to_csv example.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
files		files
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
process.sh		process.sh
requirements.txt		requirements.txt
things_to_do		things_to_do

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf_table_to_csv

This project uses

What does it do

How to use pdf_table_to_csv

About

Releases

Packages

Contributors 2

Languages

License

andrewpearce-digital/pdf_table_to_csv

Folders and files

Latest commit

History

Repository files navigation

pdf_table_to_csv

This project uses

What does it do

How to use pdf_table_to_csv

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages