Nachrichtenleicht Crawler is a web scraping project designed to extract German news articles and audio from the website nachrichtenleicht.de. This crawler helps German learners, particularly at the A2-B1 level, acquire materials for intensive listening practice.
- News Text Scraping: Fetches the latest German news articles from Nachrichtenleicht, which are written in simple and accessible language, ideal for beginners.
- Audio Download: Collects the corresponding audio files for each article, enabling learners to practice listening.
- Text Formatting: The scraped text is formatted into a one-sentence-per-line structure, making it easier to process for subtitles or other learning tools.
Run the following command in the root directory to install the required dependencies:
npm i
Use the following command to scrape the latest news text and the URLs of the audio files from the Nachrichtenleicht website:
node index.js
To download the audio files from the scraped URLs, run the following command:
node audioDownloader.js
Here is an example of the output from scraping the news articles and audio: