Releases: rlnorthcutt/sitemapExport
1.0.0 Release!
After some testing and feedback, I think we are good for a stable release. Minor changes here:
- minor changes to README
- adding devcontainer setup - yay!
- adding VSCode workspace file for those that use it (plays nicely with devcontainer)
- fixing error with 2 short flags being the same
Changelog: 0.0.9...1.0.0
0.0.9
Release Notes - sitemapExport
v0.0.9 (First Official Release)
Release Date: October 2024
I'm excited to announce the first official release of sitemapExport
! This version marks the initial stable version of my Go-based CLI tool designed to crawl sitemaps or RSS feeds, extract content from web pages, and compile the data into a variety of formats. Below is a summary of the features, improvements, and functionality included in this release.
🚀 New Features
-
Sitemap and RSS Feed Crawling
sitemapExport
supports both sitemaps and RSS feeds as input sources.- Automatically detects whether a provided URL is a sitemap or an RSS feed.
-
Content Extraction via CSS Selectors
- Extract content from pages using a customizable CSS selector (default:
body
).
- Extract content from pages using a customizable CSS selector (default:
-
Supported Output Formats
- Text (
txt
): Outputs plain text files. - JSON (
json
): Outputs formatted JSON files. - JSON Lines (
jsonl
): Outputs one JSON object per line (ideal for large datasets). - Markdown (
md
): Outputs content in Markdown format. - PDF (
pdf
): Outputs content as a PDF document.
- Text (
-
File Output Options
- Customizable output filename and file type via command-line flags or interactive prompts.
-
Interactive and Non-Interactive Modes
- Use the CLI interactively, where missing input is prompted, or pass all options via flags for automation.
-
Flexible Content Transformation
- Convert page content into various formats:
- HTML: Sanitized and clean HTML content.
- Markdown (MD): Convert HTML to Markdown using the
html-to-markdown
package. - Text (TXT): Convert HTML to plain text using
html2text
processing.
- Convert page content into various formats:
⚙️ Enhancements
-
Input Validation:
- Validates user input for file type and content format to prevent unsupported formats.
-
User-Friendly Prompts:
- Provides default values for CSS selectors, output filename, file type, and format when not supplied by the user.
- Prompts the user for confirmation before proceeding with the export.
-
Error Handling:
- Improved error handling with descriptive messages at each step (e.g., crawling, formatting, writing to file).
- Detects invalid URLs, unsupported formats, and feed type issues early in the process.
🔧 Technical Details
-
Modular Project Structure:
crawler/
: Handles sitemap and RSS crawling, content extraction.formatter/
: Formats the extracted content into different file types.writer/
: Manages file output and writing content to disk.feed/
: Detects feed types (sitemap or RSS).main.go
: Entry point for the CLI, leveragingcobra
for flag parsing and command execution.
-
Command-Line Interface:
- Built with
cobra
, providing flexibility for future expansion and additional commands.
- Built with
-
Dependencies:
goquery
for parsing HTML.sanitize
for cleaning and sanitizing HTML content.gofpdf
for generating PDF output.html-to-markdown
for Markdown conversion.
Full Changelog: https://github.com/rlnorthcutt/sitemapExport/commits/0.0.9