Skip to content
Victoria Cheng edited this page Jan 1, 2025 · 4 revisions

Welcome to the articles-extractor wiki!

The app is currently capable of extracting articles from:

  • freeCodeCamp
  • Substack
  • GitHub

I may expand support to include additional sites.

Getting started

Installation

git clone [email protected]:victoriacheng15/articles-extractor.git

cd articles-extractor

Set up Google APIs

Check out the guide from Python quickstart by Google

credentials.json

Once you get the Google Sheet API, you must get the credentials from Google, rename the JSON file to credentials.json, and move the file to the root directory. If you are to use a different name than credentials.json, you would need to update the file name for the get_creds function in utils/sheet.py

Add your list of providers in Google Sheet

Create a new worksheet and name the sheet providers

Example of the table:

name element url
freecodecamp article https://www.freecodecamp.org/news/
substack pencraft pc-display-flex pc-flexDirection-column pc-gap-4 the archive blog link
substack pencraft pc-display-flex pc-flexDirection-column pc-gap-4 the archive blog link
substack pencraft pc-display-flex pc-flexDirection-column pc-gap-4 the archive blog link

Set up data/providoers.py - No longer support for the new version

Create a file named providers.py under the data folder. If you save the file with a different name, you need to update the path in main.py at line 5.

Single URL:

provider_name = {
    "class": "the element that contains article Info",
    "urls": # url,
}

For example:

freecodecamp = {"class": "post-card", "url": "https://www.freecodecamp.org/news/"}

Multiply URLs:

provider_name = {
    "class": "the element that contains article Info",
    "URLs": [
      "url1",
      "url2",
      ......
    ],
}

Run the extractor with Docker container

You can easily run the extractor using the provided Makefile. Execute the following command:

make start

Deployment

You can put the app on the Raspberry Pi or your main machine.

  • Create the bash script touch articles_extractor.sh
  • Run chmod +x articles_extractor.sh
  • Go to the folder, and type pwd to get the folder path
  • Copy the code below to articles_extractor.sh
#!/bin/bash

cd your_path/articles-extractor

docker compose up

# I save the log in Documents folder
docker logs extractor > log_articles.txt

docker compose down

The logging is optional, if you do not want the log file:

#!/bin/bash

cd your_path/articles-extractor

docker compose up

docker compose down

Cron Schedule

Use crontab guru to find the time you want it to run.

Let's say you would like to run the app at 9 am every day, the cron looks like 0 9 * * *

  • Type crontab -e
  • If it promotes which editor to use, pick nano or your choice
  • Scroll down all the way to the bottom
  • Type 0 9 * * * your_path/articles_extractor.sh
  • ctrl + o -> enter -> ctrl + x