-
Further investigate whether it is possible to get the data from the side panel from a company Wikipedia site
-
Get a list of companies <different contact - [email protected]>
Kasia to reach out to DataQuality team, asking to get a list of aprox 1000 company names without description ([email protected])
-
Ask for a fair distribution of company sizes for which we don’t have description
-
Ask to get these names together with their industry and rev bands
-
Company name cleaning (https://pypi.org/project/cleanco/)
-
Figure out how to best utilize “categories” to drill down to only finding companies (and not other random stuff)
Harvesting of cleaned company names from Wikipedia
-
Set up timer, that captures how many ms go into capturing the info of each company
-
Setup a report that shows description harvesting penetration
- Report should show for combinations of rev bands x industry the raw number of company names given by DQ, and also the number of those companies for which we found company descriptions on wikipedia