Releases · rlnorthcutt/sitemapExport

Release Notes - `sitemapExport` v0.0.9 (First Official Release)

Release Date: October 2024

I'm excited to announce the first official release of sitemapExport! This version marks the initial stable version of my Go-based CLI tool designed to crawl sitemaps or RSS feeds, extract content from web pages, and compile the data into a variety of formats. Below is a summary of the features, improvements, and functionality included in this release.

🚀 New Features

Sitemap and RSS Feed Crawling
- sitemapExport supports both sitemaps and RSS feeds as input sources.
- Automatically detects whether a provided URL is a sitemap or an RSS feed.
Content Extraction via CSS Selectors
- Extract content from pages using a customizable CSS selector (default: body).
Supported Output Formats
- Text (txt): Outputs plain text files.
- JSON (json): Outputs formatted JSON files.
- JSON Lines (jsonl): Outputs one JSON object per line (ideal for large datasets).
- Markdown (md): Outputs content in Markdown format.
- PDF (pdf): Outputs content as a PDF document.
File Output Options
- Customizable output filename and file type via command-line flags or interactive prompts.
Interactive and Non-Interactive Modes
- Use the CLI interactively, where missing input is prompted, or pass all options via flags for automation.
Flexible Content Transformation
- Convert page content into various formats:
  - HTML: Sanitized and clean HTML content.
  - Markdown (MD): Convert HTML to Markdown using the html-to-markdown package.
  - Text (TXT): Convert HTML to plain text using html2text processing.

⚙️ Enhancements

Input Validation:
- Validates user input for file type and content format to prevent unsupported formats.
User-Friendly Prompts:
- Provides default values for CSS selectors, output filename, file type, and format when not supplied by the user.
- Prompts the user for confirmation before proceeding with the export.
Error Handling:
- Improved error handling with descriptive messages at each step (e.g., crawling, formatting, writing to file).
- Detects invalid URLs, unsupported formats, and feed type issues early in the process.

🔧 Technical Details

Modular Project Structure:
- crawler/: Handles sitemap and RSS crawling, content extraction.
- formatter/: Formats the extracted content into different file types.
- writer/: Manages file output and writing content to disk.
- feed/: Detects feed types (sitemap or RSS).
- main.go: Entry point for the CLI, leveraging cobra for flag parsing and command execution.
Command-Line Interface:
- Built with cobra, providing flexibility for future expansion and additional commands.
Dependencies:
- goquery for parsing HTML.
- sanitize for cleaning and sanitizing HTML content.
- gofpdf for generating PDF output.
- html-to-markdown for Markdown conversion.

Full Changelog: https://github.com/rlnorthcutt/sitemapExport/commits/0.0.9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release Notes - `sitemapExport` v0.0.9 (First Official Release)

🚀 New Features

⚙️ Enhancements

🔧 Technical Details

Releases: rlnorthcutt/sitemapExport

1.0.0 Release!

0.0.9

Release Notes - sitemapExport v0.0.9 (First Official Release)

🚀 New Features

⚙️ Enhancements

🔧 Technical Details

Release Notes - `sitemapExport` v0.0.9 (First Official Release)