Webscraping Sideprojects

This repository contains codes for several webscraping one-off projects that I have done. Most of them are in jupyter notebooks because the objective was to extract the data once and then deliver it to the client.

Restor-Eco Scraper

The restore-eco folder contains the code to scrape organizations from the Restor-Eco website

The code is written in Python and uses Selenium and BeautifulSoup to scrape the organizations data. The chromedriver should be in the path outlined in constants.py, so this path should be updated accordingly.

The code takes a couple of hours to run, and the output is saved in the eco-restor-orgs.csv file.

Indian Schools - BeautifulSoup

The /india folder contains code that provides a rough approach to obtain data from the schools displayed in the following site.

The folder named format_data contains a notebook which concats all the extracted data and formats it as desired.

The folder named extraction contains several notebooks used to webscrap the page by using its API. The approach is pretty rough because we iterate the schoolID parameter from 1 to 1,495,000. For that reason, 8 copies of the code were created to extract the data as fast as possible. The bulk_webscraping notebook shows another possible approach that was not fully pursued.

Tanzania Schools - BeautifulSoup

The /tanzania folder contains code that extracts data from the Tanzania schools from this website.

The notebook contains the functional code, while the .py contains a class object that was not fully implemented.

INE - API Extraction

The /ine folder contains code that was used to extract data from the 2022 mexican presidential poll election. The website is no longer available.

The extract_casillas notebook calls the website's API to extract information per municipality, while the format_data manipulates the resulting dataframes to obtain a cleaner version.

Colombia - BeautifulSoup

Data from all of the schools displayed in the following page. The code extracted the characteristics from such schools, and then performed data cleaning to dump the data into a dataframe.

Colombia Transparency - Selenium Extraction

The /colombia_transparency folder contains a notebook that extracts all of the excel files from the following page using Selenium and BeautifulSoup. The code receives the name of a municipality, extracts its historic income data and saves it into a folder.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
colombia		colombia
colombia_transparency		colombia_transparency
india		india
ine		ine
restor-eco		restor-eco
tanzania		tanzania
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Webscraping Sideprojects

Restor-Eco Scraper

Indian Schools - BeautifulSoup

Tanzania Schools - BeautifulSoup

INE - API Extraction

Colombia - BeautifulSoup

Colombia Transparency - Selenium Extraction

About

Uh oh!

Releases

Packages

Uh oh!

Languages

FedericoDM/webscraping-sideprojects

Folders and files

Latest commit

History

Repository files navigation

Webscraping Sideprojects

Restor-Eco Scraper

Indian Schools - BeautifulSoup

Tanzania Schools - BeautifulSoup

INE - API Extraction

Colombia - BeautifulSoup

Colombia Transparency - Selenium Extraction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages