Skip to content

Latest commit

 

History

History
19 lines (13 loc) · 1.46 KB

RUNME.md

File metadata and controls

19 lines (13 loc) · 1.46 KB

Run me

This solution is made of 3 notebooks.

CSR download

Given a synthetic portfolio (stored as json file on config folder), we download raw CSR reports from responsibilityreports.com that we store on a given volume (dictated by your config file). We use Tika-OCR library from databricks labs to read and process unstructured documents.

Please ensure you installed library as a maven dependency to your cluster maven

We recommend leveraging tesseract binaries since text might be included in pictures. This can be done using init script at cluster startup. init

CSR scoring

We extract dominant topics from CSR reports using simple LDA model fine tuned with hyperopts. We make use of DBRX model for naming each topic. Please ensure foundational model API is available on your workspace. dbrx

GDELT download

We want to enrich our ESG scoring strategy with alternative dataset provided by GDELT. While preliminary version of this solution was downloading news events from GDELT website, the same is now available on marketplace. Adding this dataset will create a dedicated catalog / schema that must be reported in your configuration file.