This repository contains codes for several webscraping one-off projects that I have done. Most of them are in jupyter notebooks because the objective was to extract the data once and then deliver it to the client.
The restore-eco
folder contains the code to scrape organizations from the Restor-Eco website
The code is written in Python and uses Selenium and BeautifulSoup to scrape the organizations data. The chromedriver should be in the path outlined in constants.py
, so this path should be updated accordingly.
The code takes a couple of hours to run, and the output is saved in the eco-restor-orgs.csv
file.
The /india folder contains code that provides a rough approach to obtain data from the schools displayed in the following site.
The folder named format_data contains a notebook which concats all the extracted data and formats it as desired.
The folder named extraction contains several notebooks used to webscrap the page by using its API. The approach is pretty rough because we iterate the schoolID parameter from 1 to 1,495,000. For that reason, 8 copies of the code were created to extract the data as fast as possible. The bulk_webscraping notebook shows another possible approach that was not fully pursued.
The /tanzania folder contains code that extracts data from the Tanzania schools from this website.
The notebook contains the functional code, while the .py contains a class object that was not fully implemented.
The /ine folder contains code that was used to extract data from the 2022 mexican presidential poll election. The website is no longer available.
The extract_casillas notebook calls the website's API to extract information per municipality, while the format_data manipulates the resulting dataframes to obtain a cleaner version.
Data from all of the schools displayed in the following page. The code extracted the characteristics from such schools, and then performed data cleaning to dump the data into a dataframe.
The /colombia_transparency folder contains a notebook that extracts all of the excel files from the following page using Selenium and BeautifulSoup. The code receives the name of a municipality, extracts its historic income data and saves it into a folder.