html2md

A bash tool that extracts content from HTML using CSS selectors and converts it to Markdown, supporting both processing of individual HTML documents via stdin and batch processing of multiple URLs from a file.

Features

Extract specific content from HTML using CSS selectors
Convert HTML to well-formatted Markdown
Process HTML from stdin or from URLs
Batch process multiple URLs from a file
Keeps special characters and formatting intact

Requirements

curl - for downloading web pages
pup - for HTML parsing using CSS selectors
perl - for text processing
pandoc - for HTML to Markdown conversion

Installation

Clone this repository:

git clone https://github.com/yourusername/html2md.git

Make the script executable:
```
chmod +x html2md
```
Consider moving it to your PATH:
```
sudo cp html2md /usr/local/bin/
```

Install dependencies (example for Debian/Ubuntu):

sudo apt install curl perl pandoc
# For pup, you may need to install from GitHub:
# https://github.com/ericchiang/pup

Usage

Basic Usage

cat page.html | html2md "div.content" > output.md

View Help

html2md
html2md -h
html2md --help

Process HTML from stdin

curl -s https://example.com | html2md "main"

Process a single URL

echo "https://example.com" > urls.txt
html2md --urls=urls.txt "article"

Process multiple URLs

Create a file with URLs, one per line:

https://example.com
https://example.org
# This line is a comment and will be skipped
https://example.net

Then process them all:

html2md --urls=urls.txt "div.content"

CSS Selector Syntax

The tool uses pup for CSS selection. Some examples:

article - Select all article elements
div.content - Select div elements with class "content"
#main-content - Select element with id "main-content"
article p - Select paragraphs inside article elements

For more details, visit: https://github.com/ericchiang/pup

Markdown Output Format

The output is formatted using Pandoc with GitHub Flavored Markdown (GFM) plus the following extensions:

bracketed_spans
definition_lists
fancy_lists
implicit_figures
smart (for proper quotes and dashes)
subscript
superscript

For more information about Pandoc's Markdown extensions, see the Pandoc documentation.

Examples

Extract the main content from a news article:

curl -s https://news-site.com/article/12345 | html2md "article.main-content" > article.md

Extract titles from multiple pages:

html2md --urls=blog-posts.txt "h1.title"

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
html2md		html2md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

html2md

Features

Requirements

Installation

Usage

Basic Usage

View Help

Process HTML from stdin

Process a single URL

Process multiple URLs

CSS Selector Syntax

Markdown Output Format

Examples

License

About

Uh oh!

Releases

Packages

Languages

License

Kntnt/html2md

Folders and files

Latest commit

History

Repository files navigation

html2md

Features

Requirements

Installation

Usage

Basic Usage

View Help

Process HTML from stdin

Process a single URL

Process multiple URLs

CSS Selector Syntax

Markdown Output Format

Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages