Skip to content

CheckFirstHQ/Pravda-links-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Pravda News Extractor

Description

The Pravda News Extractor is a Python script designed to fetch news data from a specified Pravda domain, extract key details using Beautiful Soup, and save the collected data in a JSON format. The script handles different formats of HTML content and iteratively collects news items until no more are found.

Features

  • Fetch initial news items from a specified domain's API.
  • Extract news items including ID, image URL, link, title, category, and timestamp.
  • Continue fetching additional news items based on the last ID until all are collected.
  • Handle different HTML content structures for image sources.
  • Save extracted news items in a JSON file with a naming convention based on the domain and current date.

Installation

  1. Ensure Python 3 is installed on your system.
  2. Install Beautiful Soup 4 and Requests library:
    pip install beautifulsoup4 requests
    
  3. Download pravda-extract.py from this repository.

Usage

Run the script with the domain as an argument:

python3 pravda-extract.py [DOMAIN]

For example:

python3 pravda-extract.py pravda-fi.com

This command will fetch news from 'pravda-fi.com' and save it in a JSON file named pravda-fi.com_DD-MM-YY.json.

Dependencies

  • Python 3
  • Beautiful Soup 4
  • Requests

Contributing

Contributions, issues, and feature requests are welcome. Feel free to check issues page if you want to contribute.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Extract all the articles links from any Pravda domain

Resources

License

Stars

Watchers

Forks

Languages