Skip to content

kasiahewelt/wikipedia-webscraping

Repository files navigation

wikipedia-webscraping

Wikipedia company description harvesting

  1. Further investigate whether it is possible to get the data from the side panel from a company Wikipedia site

  2. Get a list of companies <different contact - [email protected]>

Kasia to reach out to DataQuality team, asking to get a list of aprox 1000 company names without description ([email protected])

  • Ask for a fair distribution of company sizes for which we don’t have description

  • Ask to get these names together with their industry and rev bands

  1. Company name cleaning (https://pypi.org/project/cleanco/)

  2. Figure out how to best utilize “categories” to drill down to only finding companies (and not other random stuff)

Harvesting of cleaned company names from Wikipedia

  1. Set up timer, that captures how many ms go into capturing the info of each company

  2. Setup a report that shows description harvesting penetration

  • Report should show for combinations of rev bands x industry the raw number of company names given by DQ, and also the number of those companies for which we found company descriptions on wikipedia
  1. Matching (https://pypi.org/project/company-name-matching/)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages