A bash tool that extracts content from HTML using CSS selectors and converts it to Markdown, supporting both processing of individual HTML documents via stdin and batch processing of multiple URLs from a file.
- Extract specific content from HTML using CSS selectors
- Convert HTML to well-formatted Markdown
- Process HTML from stdin or from URLs
- Batch process multiple URLs from a file
- Keeps special characters and formatting intact
curl
- for downloading web pagespup
- for HTML parsing using CSS selectorsperl
- for text processingpandoc
- for HTML to Markdown conversion
-
Clone this repository:
git clone https://github.com/yourusername/html2md.git
-
Make the script executable:
chmod +x html2md
-
Consider moving it to your PATH:
sudo cp html2md /usr/local/bin/
-
Install dependencies (example for Debian/Ubuntu):
sudo apt install curl perl pandoc # For pup, you may need to install from GitHub: # https://github.com/ericchiang/pup
cat page.html | html2md "div.content" > output.md
html2md
html2md -h
html2md --help
curl -s https://example.com | html2md "main"
echo "https://example.com" > urls.txt
html2md --urls=urls.txt "article"
Create a file with URLs, one per line:
https://example.com
https://example.org
# This line is a comment and will be skipped
https://example.net
Then process them all:
html2md --urls=urls.txt "div.content"
The tool uses pup for CSS selection. Some examples:
article
- Select all article elementsdiv.content
- Select div elements with class "content"#main-content
- Select element with id "main-content"article p
- Select paragraphs inside article elements
For more details, visit: https://github.com/ericchiang/pup
The output is formatted using Pandoc with GitHub Flavored Markdown (GFM) plus the following extensions:
- bracketed_spans
- definition_lists
- fancy_lists
- implicit_figures
- smart (for proper quotes and dashes)
- subscript
- superscript
For more information about Pandoc's Markdown extensions, see the Pandoc documentation.
Extract the main content from a news article:
curl -s https://news-site.com/article/12345 | html2md "article.main-content" > article.md
Extract titles from multiple pages:
html2md --urls=blog-posts.txt "h1.title"