This repository contains a script to clean and extract metadata from HTML files. The script performs the following tasks:
- Extract metadata, sections, links, code blocks, and images from HTML files.
- Skip large files and log errors.
- Handle and log skipped files for further processing.
To run the script, use the following command:
python3 main.py
main.py
: The main script to process HTML files.venv/
: The virtual environment directory.
- Delete All Files Except
main.py
andvenv
Directory
This command deletes all files and directories except for main.py
and the venv
directory:
find . -mindepth 1 ! -regex './main.py\|./venv\(/.*\)?' -delete
- Watch and Tail Log File
This command continuously watches and displays the last 20 lines of the app.log
file:
watch -n 1 tail -n 20 app.log
- Multi Clip
This alias/script copies all the contents of the current directory, gathers them under one file, and copies all the contents to the clipboard:
multi_clip -i ignore_list.txt .