Parse complicated two tables per page PDF #340

ocefpaf · 2019-11-26T16:37:41Z

This pdf has a complex 2-table on a single page in page 2. Right now the best result is setting the algorithm to header-position but it seems that one still needs to extend it to accommodate the odd table format.

import rows


tables = rows.import_from_pdf(
    "Ibama.pdf",
    page_numbers=[2],
    algorithm="header-position", # `rects-boundaries` does not work and `y-groups` mixes header with entries
    backend="pymupdf",  # `pymupdf` yields the best results
)

row = tables[0]._asdict().keys()

dict_keys(
    [
        'praiadocarroquebrado',
        'barradesantoantonio',
        'field_2019_09_18',
        'al',
        'field_09203008s_35265532w_2019_10_21',
        'oleada_manchas',
        'barradoriocamaratuba',
        'mataraca',
        'field_2019_09_07',
        'pb',
        'field_06353346s_34575812w_2019_10_04',
        'oleo_naoobservado',
        'name',
        'municipio',
        'data_avist_estado_latitude',
        'longitude',
        'data_revis_status',
        'praiadocabobranco',
        'joaopessoa',
        'field_2019_09_01',
        'pb_2',
        'field_07084334s_34483384w_2019_10_01',
        'oleo_naoobservado_2'
    ]
)

I'm kind of jealous of R for the first time b/c this operation is a 1-liner with tabulizer ;-p

I'll look into extending header-position but if that is an exercise that should always be on the user side feel free to just close this issue.

The text was updated successfully, but these errors were encountered:

turicas · 2019-11-27T22:43:43Z

Note: try with tabula-py: https://github.com/ocefpaf/oilmap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse complicated two tables per page PDF #340

Parse complicated two tables per page PDF #340

ocefpaf commented Nov 26, 2019

turicas commented Nov 27, 2019

Parse complicated two tables per page PDF #340

Parse complicated two tables per page PDF #340

Comments

ocefpaf commented Nov 26, 2019

turicas commented Nov 27, 2019