A Node.js bot that scrapes text from news articles and saves them as individual text files.
- Scrape text content from news article URLs
- Save each article as a separate .txt file
- Handles various news site layouts
- Preserves article title and URL in the saved file
- Sanitizes filenames for compatibility
- Anti-ban measures:
- Proxy support (HTTP, HTTPS, SOCKS5, SmartProxy)
- Rotating proxy list support
- User-agent rotation
- Request delays
- Robots.txt compliance
- Make sure you have Node.js installed (v12 or higher recommended)
- Clone or download this repository
- Install dependencies:
cd news-scraper
npm install
- Configure settings in the
.env
file (see Configuration section below)
There are multiple ways to use this scraper:
node src/index.js https://example.com/article1 https://example.com/article2
node src/index.js urls.txt
node src/index.js
When run without arguments, the program will prompt you to enter URLs one by one. Press Enter on an empty line to start scraping.
The scraper can be configured using the .env
file. Copy .env.example
to .env
and modify the settings:
To use a proxy, set the PROXY_TYPE
to one of: http
, https
, socks5
, smartproxy
, proxy_list
and configure the corresponding proxy settings:
# Proxy type
PROXY_TYPE=http
# HTTP proxy (format: http://username:password@host:port)
HTTP_PROXY=http://user:[email protected]:8080
# Proxy type
PROXY_TYPE=smartproxy
# SmartProxy settings
SMARTPROXY_USER=your_username
SMARTPROXY_PASS=your_password
SMARTPROXY_HOST=gate.smartproxy.com
SMARTPROXY_PORT=7000
# Proxy type
PROXY_TYPE=proxy_list
# Path to a file containing a list of proxies (one per line)
PROXY_LIST_FILE=proxies.txt
The proxies.txt
file should contain one proxy per line in the format:
http://user1:[email protected]:8080
http://user2:[email protected]:8080
socks5://user3:[email protected]:1080
The scraper will rotate through these proxies for each request, helping to distribute the load and avoid IP bans.
# Delay between requests in milliseconds (to avoid rate limiting)
REQUEST_DELAY=2000
# Whether to rotate user agents for each request
USE_RANDOM_USER_AGENT=true
# Whether to respect robots.txt
RESPECT_ROBOTS_TXT=true
If you're scraping a large number of articles, especially from the same domain, using a proxy is recommended to avoid IP bans. Options include:
-
Proxy List: The most flexible option. Create a
proxies.txt
file with your proxy list and setPROXY_TYPE=proxy_list
in your.env
file. The scraper will rotate through these proxies automatically. -
SmartProxy: This scraper has built-in support for SmartProxy, which provides rotating residential IPs. Simply set your credentials in the
.env
file and setPROXY_TYPE=smartproxy
. -
Other Rotating Residential Proxies: Services like Bright Data, Oxylabs, or SmartProxy provide residential IPs that are less likely to be detected as proxies.
-
Datacenter Proxies: More affordable but may be detected more easily. Providers include ProxyMesh, IPRoyal, or Webshare.
-
Free Proxies: Not recommended for serious scraping as they are often unreliable, slow, and may be already banned.
Scraped articles are saved to the articles
directory in the project root. Each file is named based on the article title.
If the scraper fails to extract content from a particular site, it may be due to:
- The site using JavaScript to load content (this scraper only handles static HTML)
- The site having an unusual structure
- The site blocking automated requests
In such cases, you might need to modify the content selectors in the code to match the specific site structure or use a proxy if you're being blocked.