This project is a Python script that automates the extraction of data from an HTML table on a specified website. The script uses Selenium for web automation, BeautifulSoup for parsing HTML, and pandas for data manipulation.
- Logs into a specified website using credentials from a YAML configuration file.
- Navigates through paginated results.
- Extracts data from an HTML table and stores it in a pandas DataFrame.
- Displays the head of the DataFrame on the first page and the current page number on subsequent pages.
- The
munge.ipynb
file processes the extracted CSV data to generate reports of files that exceed a specified similarity threshold or that failed to process on the online website.
- Python 3.x
Selenium
pandas
BeautifulSoup4
lxml
-
Clone the repository:
-
Install the required packages:
pip install selenium pandas beautifulsoup4 lxml
-
Ensure you have the appropriate web driver (e.g., geckodriver for Firefox) installed and included in your system's PATH.
- Edit the
credentials.yml
file and ensure thatgrabTable.py
is reading it in
After extracting the data into a CSV file, you can use the munge.ipynb
Jupyter notebook to process the data. This notebook generates reports for files that exceed a specified similarity threshold or that failed to process on the online website.