Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse complicated two tables per page PDF #340

Open
ocefpaf opened this issue Nov 26, 2019 · 1 comment
Open

Parse complicated two tables per page PDF #340

ocefpaf opened this issue Nov 26, 2019 · 1 comment

Comments

@ocefpaf
Copy link

ocefpaf commented Nov 26, 2019

This pdf has a complex 2-table on a single page in page 2. Right now the best result is setting the algorithm to header-position but it seems that one still needs to extend it to accommodate the odd table format.

import rows


tables = rows.import_from_pdf(
    "Ibama.pdf",
    page_numbers=[2],
    algorithm="header-position", # `rects-boundaries` does not work and `y-groups` mixes header with entries
    backend="pymupdf",  # `pymupdf` yields the best results
)

row = tables[0]._asdict().keys()

dict_keys(
    [
        'praiadocarroquebrado',
        'barradesantoantonio',
        'field_2019_09_18',
        'al',
        'field_09203008s_35265532w_2019_10_21',
        'oleada_manchas',
        'barradoriocamaratuba',
        'mataraca',
        'field_2019_09_07',
        'pb',
        'field_06353346s_34575812w_2019_10_04',
        'oleo_naoobservado',
        'name',
        'municipio',
        'data_avist_estado_latitude',
        'longitude',
        'data_revis_status',
        'praiadocabobranco',
        'joaopessoa',
        'field_2019_09_01',
        'pb_2',
        'field_07084334s_34483384w_2019_10_01',
        'oleo_naoobservado_2'
    ]
)

I'm kind of jealous of R for the first time b/c this operation is a 1-liner with tabulizer ;-p

I'll look into extending header-position but if that is an exercise that should always be on the user side feel free to just close this issue.

@turicas
Copy link
Owner

turicas commented Nov 27, 2019

Note: try with tabula-py: https://github.com/ocefpaf/oilmap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants