Skip to content

Releases: rlnorthcutt/sitemapExport

1.0.0 Release!

16 Nov 00:44
Compare
Choose a tag to compare

After some testing and feedback, I think we are good for a stable release. Minor changes here:

  • minor changes to README
  • adding devcontainer setup - yay!
  • adding VSCode workspace file for those that use it (plays nicely with devcontainer)
  • fixing error with 2 short flags being the same

Changelog: 0.0.9...1.0.0

0.0.9

24 Oct 18:49
Compare
Choose a tag to compare

Release Notes - sitemapExport v0.0.9 (First Official Release)

Release Date: October 2024

I'm excited to announce the first official release of sitemapExport! This version marks the initial stable version of my Go-based CLI tool designed to crawl sitemaps or RSS feeds, extract content from web pages, and compile the data into a variety of formats. Below is a summary of the features, improvements, and functionality included in this release.

🚀 New Features

  • Sitemap and RSS Feed Crawling

    • sitemapExport supports both sitemaps and RSS feeds as input sources.
    • Automatically detects whether a provided URL is a sitemap or an RSS feed.
  • Content Extraction via CSS Selectors

    • Extract content from pages using a customizable CSS selector (default: body).
  • Supported Output Formats

    • Text (txt): Outputs plain text files.
    • JSON (json): Outputs formatted JSON files.
    • JSON Lines (jsonl): Outputs one JSON object per line (ideal for large datasets).
    • Markdown (md): Outputs content in Markdown format.
    • PDF (pdf): Outputs content as a PDF document.
  • File Output Options

    • Customizable output filename and file type via command-line flags or interactive prompts.
  • Interactive and Non-Interactive Modes

    • Use the CLI interactively, where missing input is prompted, or pass all options via flags for automation.
  • Flexible Content Transformation

    • Convert page content into various formats:
      • HTML: Sanitized and clean HTML content.
      • Markdown (MD): Convert HTML to Markdown using the html-to-markdown package.
      • Text (TXT): Convert HTML to plain text using html2text processing.

⚙️ Enhancements

  • Input Validation:

    • Validates user input for file type and content format to prevent unsupported formats.
  • User-Friendly Prompts:

    • Provides default values for CSS selectors, output filename, file type, and format when not supplied by the user.
    • Prompts the user for confirmation before proceeding with the export.
  • Error Handling:

    • Improved error handling with descriptive messages at each step (e.g., crawling, formatting, writing to file).
    • Detects invalid URLs, unsupported formats, and feed type issues early in the process.

🔧 Technical Details

  • Modular Project Structure:

    • crawler/: Handles sitemap and RSS crawling, content extraction.
    • formatter/: Formats the extracted content into different file types.
    • writer/: Manages file output and writing content to disk.
    • feed/: Detects feed types (sitemap or RSS).
    • main.go: Entry point for the CLI, leveraging cobra for flag parsing and command execution.
  • Command-Line Interface:

    • Built with cobra, providing flexibility for future expansion and additional commands.
  • Dependencies:

    • goquery for parsing HTML.
    • sanitize for cleaning and sanitizing HTML content.
    • gofpdf for generating PDF output.
    • html-to-markdown for Markdown conversion.

Full Changelog: https://github.com/rlnorthcutt/sitemapExport/commits/0.0.9