The Pravda News Extractor is a Python script designed to fetch news data from a specified Pravda domain, extract key details using Beautiful Soup, and save the collected data in a JSON format. The script handles different formats of HTML content and iteratively collects news items until no more are found.
- Fetch initial news items from a specified domain's API.
- Extract news items including ID, image URL, link, title, category, and timestamp.
- Continue fetching additional news items based on the last ID until all are collected.
- Handle different HTML content structures for image sources.
- Save extracted news items in a JSON file with a naming convention based on the domain and current date.
- Ensure Python 3 is installed on your system.
- Install Beautiful Soup 4 and Requests library:
pip install beautifulsoup4 requests
- Download
pravda-extract.py
from this repository.
Run the script with the domain as an argument:
python3 pravda-extract.py [DOMAIN]
For example:
python3 pravda-extract.py pravda-fi.com
This command will fetch news from 'pravda-fi.com' and save it in a JSON file named pravda-fi.com_DD-MM-YY.json
.
- Python 3
- Beautiful Soup 4
- Requests
Contributions, issues, and feature requests are welcome. Feel free to check issues page if you want to contribute.
This project is licensed under the MIT License - see the LICENSE file for details.