Skip to content

Latest commit

 

History

History
415 lines (280 loc) · 13 KB

README.md

File metadata and controls

415 lines (280 loc) · 13 KB

EU Tax Observatory

Usage

Installing

To install the library in a dedicated virtual environnement :

python3 -m venv venv
source venv/bin/activate
python3 -m pip install git+https://github.com/dataforgoodfr/12_taxobservatory.git

Running the PDF downloader

To run the report downloader from the command line, you can invoke the pdf_downloader module:

python3 -m collecte.pdf_downloader company_names.csv

In addition, multiple optional parameters can be tuned. To know how to use them, you can check the help manual:

python3 -m collecte.pdf_downloader --help

A more complete example could be

python3 -m collecte.pdf_downloader company_names.csv --search_keywords "tax country by country reporting GRI 207-4" --dest_dirpath try_pdf_downloads --url_cache_filepath pdf_url_cache.pkl --fetch_timeout_s 60 --debug

The execution of this module requires a Google JSON API key as well as a search engine ID (or CX code). These must be specified in the .env file (a sample file being .env.example) :

# Required for fetching URLs with the Google JSON API
GOOGLE_API_KEY=CHANGE_ME
GOOGLE_CX=CHANGE_ME

If the pipeline runs successfully, the results folder should contain the following elements:

  • a collection of company-named folders, each containing one or multiple PDFs
  • a log file run_pdf_downloader_DD_MM_YYYY_hh_mm_ss.log storing all the runtime logging entries
  • a CSV file download_data.csv listing all the downloaded company reports and their URLs
  • a CSV file missing_data.csv listing all the missing company reports and their URLs (if some were found), plus the type of error that prevented their download

A collection of company names is made available for user convenience at test/data/company_names.csv.

Run the streamlit app

To start the streamlit app and use the extractor streamlined version, start it locally by running

streamlit run app/index.py

The app comes with page detection and parsers default config but you can change it by providing a yaml file following the config.yaml format below.

Below is an example of the pipeline running on one of the reports, parsing the tables with LlamaParse and Unstructured.

PipelineDemonstration.webm

Running the pipeline from the command line

To run the pipeline from the command line, once installed, you can invoke the country_by_country module on a pdf file as :

python3 -m country_by_country config.yaml report.pdf

The yaml file is describing the pipeline you want to execute. For now, you can specify the page filter and the table extraction algorithms. An example config.yaml file is given below :

config.yaml

pagefilter:
  type: RFClassifier
  params: 
    modelfile: random_forest_model_low_false_positive.joblib

table_extraction:
  img:
    - type: Camelot
      params:
        flavor: stream
    - type: Unstructured
      params:
        pdf_image_dpi: 300
        hi_res_model_name: "yolox"

table_cleaning:
  - type: LLM
    params:
      openai_model: "gpt-4-turbo-preview"

This config file uses:

  • a pretrained random forest for selecting the pages of the report that possibly contain a CbCR table
  • camelot with its stream flavor and unstructured with yolox as the table detector for locating and parsing the tables on the previously selected pages
  • LangChain with GPT-4-turbo-preview for requesting the parsed tables to extract and re-order the necessary informations

Available blocks

Page filter

A page filter takes as input a pdf filepath and fills in the assets under the key pagefilter:

  • src_pdf: the path to the original pdf
  • selected_pages: the list of indices of the selected pages. The indices are 0 based.

The available filters are :

Copy as is

This filter does not perform any selection on the input document and just copy the whole content as is.

From filename

This filter expects the pages to extract from the input filename either as a single page number or a page range. Valid names are given below :

  • arbitrarily_long_and_cumBerSOME_prefix_PAGENUMBER.pdf : gets the page numbered PAGENUMBER
  • arbitrarily_long_and_cumBerSOME_prefix_PAGENUMBER1-PAGENUMBER2.pdf : gets the range [PAGENUMBER1, PAGENUMBER2]

RF Classifier

This filter uses a random forest trained to identify the pages from the text the pages content. Several features are used to identify relevant pages such as :

  • the number of country names listed in the page
  • the presence of keywords such as "tax", "countr", "report", "cbc", .."

Table extraction

We allow multiple table extraction algorithms to be used simultaneously. This is the reason why the table_extraction key of the config.yaml is a list. A table extraction algorithm fills in the assets under the key table_extractors, which is a list containing the assets for every algorithm you considered. Every algorithm provides the following assets :

  • id: a unique identifier for this algorithm
  • type: the algorithm type, can be any of the listed algorithms below camelot, unstructured, unstructured_api, llama_parse
  • params: the named parameters and their values given to the construction of the algorithm
  • tables: the list of extracted tables as pandas dataframes

The following table extractors can be considered :

ExtractTable

ExtractTable is provided for legacy/benchmarking purpose. The ExtractTable python module is no more maintained but this was originally the package used to extract data from PDF tables.

You can use by specifying in the config.yaml:

table_extraction:
    - type: ExtractTableAPI

It requires an API key to be defined in your .env file :

# Required for table exctration with ExtractTable API
EXTRACTABLE_API_KEY=CHANGE_ME

Camelot

Camelot is a python library for extracting tables. The documentation is available at https://camelot-py.readthedocs.io/en/master/.

We can use two flavors : stream or lattice. It can be specified in the config as :

table_extraction:
  - type: Camelot
    params:
      flavor: stream

Unstructured API

The unstructured API is documented at https://unstructured-io.github.io/unstructured/apis/api_sdks.html. In the config.yaml, you can specify any of the parameters considered by shared.PartitionParameters although we already set strategy="hi_res", pdf_infer_table_structure="True".

For example, you can use their beta released model chipper by setting in your config.yaml :

table_extraction:
    - type: UnstructuredAPI
      params:
        hi_res_model_name: chipper 

This API requires an API key. You can create one at https://unstructured.io/api-key-free. Once you have your key, you must copy the sample .env.sample to .env :

cp .env.sample .env

and then copy your key at

UNSTRUCTURED_API_KEY=CHANGE_ME

Unstructured

In addition to use the unstructured API, you can also run unstructured locally. The parameters to be specified in your config.yaml script are given to the partition_pdf function, although we already set strategy="hi_res", infer_table_structure=True.

You can for example set the pdf_image_dpi as well as the table detection algorithm by setting :

table_extraction:
    - type: Unstructured
      params:
        pdf_image_dpi: 300
        hi_res_model_name: "yolox"

Llama parse API

The llama parse requires an API key. To create a key, go to http://cloud.llamaindex.ai. This key must be specified in the .env file, a sample file being .env.example :

# Required for table extraction with LLAMA PARSE API
LLAMA_CLOUD_API_KEY=CHANGE_ME

You can then use llama parse in your configuration as below. The parameters are forward to the constructor of LlamaParse

For example, you can customize the verbosity, ..

table_extraction:
    - type: LlamaParse
      params:
        verbosity: False

Table cleaning

Table cleaning is the last step of the pipeline, taking as input the parsed tables and extracting the relevant information. You can specify multiple table cleaners and that's the reason why table_cleaning is a list in the config.yaml. Every list of tables extracted by every table extractor will be processed by every table cleaner.

The table cleaners append their assets in the list under the table_cleaners key. As for the table extractors, the table cleaners fill in the following assets :

  • id: a unique identifier for the table cleaner execution
  • type: the type of table cleaner
  • params: the parameters given for the construction of the cleaner
  • table: the output dataframe with the expected data per country

The list of available cleaners is given below :

LangChain / LangSmith

The LangChain module can be used by specifying in the config.yaml:

table_cleaning:
    - type: LLM
      params: 
        openai_model: "gpt-4-turbo-preview"

For now, we only support OpenAI models but we may later also consider local models. For OpenAI models, you need an API key (see OpenAI website) that must be provided in your .env file :

OPENAI_API_KEY=CHANGE_ME

With LangChain, you can also trace the LLMs request using LangSmith. Although optional, this might be usefull to keep an eye on the expenses for paid language models and to debug the context/questions/answers. LangSmith requires an API key to be created by login in at https://smith.langchain.com and a project name provided in your .env file as :

LANGCHAIN_API_KEY=CHANGE_ME
LANGCHAIN_TRACING_V2=true
LANGCHAIN_PROJECT="country-by-country"

Contributing

Use a venv

python3 -m venv name-of-your-venv

source name-of-your-venv/bin/activate

Utiliser Poetry

Installer Poetry:

python3 -m pip install "poetry==1.4.0"

Installer les dépendances:

poetry install

Ajouter une dépendance:

poetry add pandas

Mettre à jour les dépendances:

poetry update

Utiliser Jupyter Notebook

jupyter notebook

and check your browser !

Lancer les precommit-hook localement

Installer les precommit

pre-commit run --all-files

Utiliser Tox pour tester votre code

tox -vv

Notebooks

Démonstrateur du pipeline

Open In Colab

Detection des pages contenant un tableau CbCR

Decision tree et random forest

Le filtre country_by_country/pagefilter/RFClassifier utilise un arbre de décision ou des random forest entrainés par le notebook ci-dessous

Open In Colab

Détection de Tableau

Deux modèles semblent concluants mais ne produisent pas les mêmes résultats.

Détection de tableau avec Yolox

Open In Colab

Détection de tableau avec Microsoft Table Transformer

Open In Colab

RAG

Llama parse + Llama index

Open In Colab