Skip to content
/ html2md Public

A bash script that extracts content from HTML using CSS selectors and converts it to Markdown, supporting both processing of individual HTML documents via stdin and batch processing of multiple URLs from a file.

License

Notifications You must be signed in to change notification settings

Kntnt/html2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

html2md

A bash tool that extracts content from HTML using CSS selectors and converts it to Markdown, supporting both processing of individual HTML documents via stdin and batch processing of multiple URLs from a file.

Features

  • Extract specific content from HTML using CSS selectors
  • Convert HTML to well-formatted Markdown
  • Process HTML from stdin or from URLs
  • Batch process multiple URLs from a file
  • Keeps special characters and formatting intact

Requirements

  • curl - for downloading web pages
  • pup - for HTML parsing using CSS selectors
  • perl - for text processing
  • pandoc - for HTML to Markdown conversion

Installation

  1. Clone this repository:

    git clone https://github.com/yourusername/html2md.git
  2. Make the script executable:

    chmod +x html2md
  3. Consider moving it to your PATH:

    sudo cp html2md /usr/local/bin/
  4. Install dependencies (example for Debian/Ubuntu):

    sudo apt install curl perl pandoc
    # For pup, you may need to install from GitHub:
    # https://github.com/ericchiang/pup

Usage

Basic Usage

cat page.html | html2md "div.content" > output.md

View Help

html2md
html2md -h
html2md --help

Process HTML from stdin

curl -s https://example.com | html2md "main"

Process a single URL

echo "https://example.com" > urls.txt
html2md --urls=urls.txt "article"

Process multiple URLs

Create a file with URLs, one per line:

https://example.com
https://example.org
# This line is a comment and will be skipped
https://example.net

Then process them all:

html2md --urls=urls.txt "div.content"

CSS Selector Syntax

The tool uses pup for CSS selection. Some examples:

  • article - Select all article elements
  • div.content - Select div elements with class "content"
  • #main-content - Select element with id "main-content"
  • article p - Select paragraphs inside article elements

For more details, visit: https://github.com/ericchiang/pup

Markdown Output Format

The output is formatted using Pandoc with GitHub Flavored Markdown (GFM) plus the following extensions:

  • bracketed_spans
  • definition_lists
  • fancy_lists
  • implicit_figures
  • smart (for proper quotes and dashes)
  • subscript
  • superscript

For more information about Pandoc's Markdown extensions, see the Pandoc documentation.

Examples

Extract the main content from a news article:

curl -s https://news-site.com/article/12345 | html2md "article.main-content" > article.md

Extract titles from multiple pages:

html2md --urls=blog-posts.txt "h1.title"

License

MIT License

About

A bash script that extracts content from HTML using CSS selectors and converts it to Markdown, supporting both processing of individual HTML documents via stdin and batch processing of multiple URLs from a file.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages