wikipedia-webscraping

Wikipedia company description harvesting

Further investigate whether it is possible to get the data from the side panel from a company Wikipedia site
Get a list of companies <different contact - [email protected]>

Kasia to reach out to DataQuality team, asking to get a list of aprox 1000 company names without description ([email protected])

Ask for a fair distribution of company sizes for which we don’t have description
Ask to get these names together with their industry and rev bands

Company name cleaning (https://pypi.org/project/cleanco/)
Figure out how to best utilize “categories” to drill down to only finding companies (and not other random stuff)

Harvesting of cleaned company names from Wikipedia

Set up timer, that captures how many ms go into capturing the info of each company
Setup a report that shows description harvesting penetration

Report should show for combinations of rev bands x industry the raw number of company names given by DQ, and also the number of those companies for which we found company descriptions on wikipedia

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
app.py		app.py
companies.csv		companies.csv
helix.py		helix.py
helix_no_description.csv		helix_no_description.csv
requirements.txt		requirements.txt
side-panel.py		side-panel.py
test.py		test.py
tests-cleaning.py		tests-cleaning.py
wikipedia-api.py		wikipedia-api.py