Clean HTML Pipeline

This repository contains a script to clean and extract metadata from HTML files. The script performs the following tasks:

Features

Extract metadata, sections, links, code blocks, and images from HTML files.
Skip large files and log errors.
Handle and log skipped files for further processing.

Usage

Running the Script

To run the script, use the following command:

python3 main.py

Directory Structure

main.py: The main script to process HTML files.
venv/: The virtual environment directory.

Tools and Commands

Delete All Files Except main.py and venv Directory

This command deletes all files and directories except for main.py and the venv directory:

find . -mindepth 1 ! -regex './main.py\|./venv\(/.*\)?' -delete

Watch and Tail Log File

This command continuously watches and displays the last 20 lines of the app.log file:

watch -n 1 tail -n 20 app.log

Multi Clip

This alias/script copies all the contents of the current directory, gathers them under one file, and copies all the contents to the clipboard:

multi_clip -i ignore_list.txt .

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
config.py		config.py
ignore_list.txt		ignore_list.txt
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Clean HTML Pipeline

Features

Usage

Running the Script

Directory Structure

Tools and Commands

About

Uh oh!

Releases

Packages

Languages

bitcode/01_clean_html

Folders and files

Latest commit

History

Repository files navigation

Clean HTML Pipeline

Features

Usage

Running the Script

Directory Structure

Tools and Commands

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages