- Introduction
- Features
- Prerequisites
- Installation
- Usage
- Script Breakdown
- Technical Details
- Future Improvements
- Contributing
- License
- Contact
This project automates the extraction and organization of public procurement data from the Spanish government website contrataciondelestado.es. By orchestrating a series of Python scripts, it navigates through web pages, scrapes relevant data, and converts it into a structured CSV format for easy analysis.
- Automated Deep Link Generation: Navigate and extract specific URLs based on search criteria.
- Data Extraction with Scrapy: Crawl tender pages to extract detailed procurement information.
- JSON to CSV Conversion: Transform the scraped JSON data into CSV format.
- Future-Proof Design: Built with scalability and future enhancements in mind.
- Python 3.6 or higher
- Libraries:
selenium
beautifulsoup4
scrapy
pandas
- WebDriver for Selenium (e.g., ChromeDriver)
-
Clone the Repository
git clone https://github.com/yourusername/your-repo-name.git cd your-repo-name
-
Create a Virtual Environment (Optional but Recommended)
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies
pip install -r requirements.txt
-
Set Up WebDriver
- Download the appropriate WebDriver for your browser (e.g., ChromeDriver).
- Ensure it's added to your system PATH or specify the path in
1_DeepLinks.py
.
Run the main script to execute the entire workflow:
python 0_Main.py
This script will sequentially execute:
1_DeepLinks.py
to generate deep links.2_ProjectFinder.py
to scrape procurement data.3_json2csv.py
to convert JSON data into CSV format.
This script scrapes specific URLs from the Spanish public procurement website.
- Technologies Used: Selenium, BeautifulSoup
- Purpose: Automate the navigation and extraction of deep-link URLs based on predefined search criteria.
- Focus: Extract data related to "Works" with a specific CPV code (41000000).
A Scrapy-based crawler designed to extract procurement details.
- Technologies Used: Scrapy
- Purpose: Scrape tender pages to extract information such as:
- Contracting Authority
- Tender Status
- Contract Object
- Base Budget
- Contract Type
- CPV Code
- Location
- Key Dates
- Tender URL
- Features:
- Custom User Agent and UTF-8 encoding for proper handling of Spanish text.
- Designed as a
CrawlSpider
for scalability.
Converts the JSON output from Scrapy into CSV format.
- Technologies Used: pandas
- Purpose: Flatten nested JSON structures and handle lists within JSON by converting them to strings.
- Output:
3_ProjectFinder.csv
containing structured data ready for analysis.
- Enhanced Input: Provide the crawler with a list of URLs from
1_DeepLinks.py
for expanded coverage. - PDF Downloading: Implement functionality to download all PDF documents linked in the crawled URLs.
- Data Export Options: Improve export options for better data analysis, possibly integrating databases.
- Robust Scraping Logic: Enhance the crawler to handle inconsistencies across different tender pages.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the GNU License.
- Author: [Isidre Canyelles]
- Email: [email protected] Feel free to reach out for any questions or collaboration opportunities.
README.md file generated with ChatGPT