Pdf2info is a simple table extractor library using tabula.
Choose one of following:
Using local python3 interpreter:
$ python3 -m pip install -r requirements
Creating a new virtual environment:
Assuming $ python3 -m pip install virtualenv
:
$ python3 -m pip install virtualenv
$ virtualenv venv
$ source venv/bin/activate
(venv) $ python3 -m pip install -r requirements.txt
Creating a conda environment (assumg you have conda/miniconda installed) :
$ conda create --name venv --file requirements.txt
Place all your necessary PDFs in a single dicrectory, then call process_folder.py script:
$ python tables_from_dir.py --dir=path/to/your/dir --out=path/to/out/folder
This will create one csv file per table.
Logging can be read in results.log file. If need to check console live log, add --log-console param:
$ python tables_from_dir.py --dir=path/to/your/dir --out=path/to/out/folder --log-console
If you only need to extract from one pdf, use tables_from_file.py
instead of tables_from_dir.py
Check analysis/tab2know_tests istructions.
Check analysis/pdf2info_tests istructions.