Skip to content
Victoria Cheng edited this page Feb 21, 2025 · 11 revisions

Welcome to the articles-extractor wiki!

The app is currently capable of extracting articles from:

  • freeCodeCamp
  • Substack
  • The GitHub Blog

I may expand support to include additional sites.

Getting started

Installation

git clone [email protected]:victoriacheng15/articles-extractor.git

cd articles-extractor

Prerequisites

1. Set Up Credentials & Environment

  • Google Sheets API:

    • Follow Google’s API quickstart guide to download credentials.json.
    • Place credentials.json in the project’s root directory.
  • Configure .env:

    cp .env.example .env  # Copy template

    Edit .env with:

    SHEET_ID="your_google_sheet_id_here"  # Found in your Sheet’s URL

2. Add Providers to Google Sheet

  • Create a worksheet:

    • Name the sheet providers (exact spelling, lowercase).
  • Build your provider list:
    Create a table with these columns:

    name element (CSS selector/class) url
    freecodecamp article https://www.freecodecamp.org/news/
    github article https://github.blog/category/engineering/
    substack pencraft pc-display-flex pc-flexDirection-column pc-gap-4 https://[your-substack].substack.com/archive
    substack pencraft pc-display-flex pc-flexDirection-column pc-gap-4 https://[your-substack].substack.com/archive
    substack pencraft pc-display-flex pc-flexDirection-column pc-gap-4 https://[your-substack].substack.com/archive

    Notes:

    • Replace [your-substack] with your actual Substack domain.

Deployment

Choose one of these methods to run the app:

Note: If you are not using GitHub Actions, disable the workflow by navigating to: Actions → Daily Extraction Schedule → Click the horizontal three dots (⋮) → Disable workflow.


1. Manual Local Run

# Create and activate virtual environment (recommended)
python3 -m venv venv

source venv/bin/activate  # Linux/Mac

# venv\Scripts\activate   # Windows

# Install dependencies and run
pip install -r requirements.txt

python3 main.py

# Alternative using Makefile
make init

make run

2. Docker + Cron Scheduling

  1. Run with Docker:

    make up  # Builds and starts the container
  2. Schedule with cron:

    crontab -e

    Add line for your schedule (for example: daily at 9 AM):

    0 9 * * * cd ~/path_to_project/articles-extractor && make up  

    Use crontab.guru to customize timing.

Replace ~/path_to_project with your project’s absolute path (use pwd in the terminal to confirm).


3. GitHub Actions

  1. Add secrets in repo settings (Settings > Secrets and variables > Actions):

    • CREDENTIALS: Paste entire content of your credentials.json
    • SHEET_ID: Your Google Sheet ID
  2. Example workflow (already configured if using repo's .github/workflows/):

    on:
      schedule:
        - cron: '0 9 * * *'  # Daily at 9 AM UTC
      workflow_dispatch:     # Manual trigger

    The action will automatically:

    • Install dependencies
    • Run main.py
    • Clean up resources

Security Note: Never commit credentials.json or .env files!