Skip to content

ccprocessor/llm-webkit-mirror

Repository files navigation

Changelog

  • 2024/11/25: Project Initialization

Table of Contents

  1. llm-web-kit
  2. TODO
  3. Known Issues
  4. FAQ
  5. All Thanks To Our Contributors
  6. License Information
  7. Acknowledgments
  8. Citation
  9. Star History
  10. Links

llm-web-kit

Project Introduction

llm-web-kit is a python library that ..

Key Features

  • Remove headers, footers, footnotes, page numbers, etc., to ensure semantic coherence.
  • Output text in human-readable order, suitable for single-column, multi-column, and complex layouts.

Quick Start

extract by magic_html+recognize

from llm_web_kit.simple import extract_html_to_md, extract_html_to_mm_md
import traceback
from loguru import logger

def extract(url:str, html:str) -> str:
    try:
        nlp_md = extract_html_to_md(url, html)
        # or mm_nlp_md = extract_html_to_mm_md(url, html)
        return nlp_md
    except Exception as e:
        logger.exception(e)
    return None

if __name__=="__main__":
    url = ""
    html = ""
    markdown = extract(url, html)

only extract by recognize

from llm_web_kit.simple import extract_html_to_md, extract_html_to_mm_md
import traceback
from loguru import logger

def extract(url:str, raw_html:str) -> str:
    try:
        nlp_md = extract_html_to_md(url, raw_html, clip_html=False)
        # or mm_nlp_md = extract_html_to_mm_md(url, raw_html, clip_html=False)
        return nlp_md
    except Exception as e:
        logger.exception(e)
    return None

if __name__=="__main__":
    url = ""
    html = ""
    markdown = extract(url, html)

only extract main_html by magic-html

from llm_web_kit.simple import extract_main_html_by_maigic_html
import traceback
from loguru import logger

def extract(url:str, html:str) -> str:
    try:
        main_html = extract_main_html_by_maigic_html(url, html)
        # or mm_main_html = extract_pure_html_to_mm_md(url, html)
        return main_html
    except Exception as e:
        logger.exception(e)
    return None

if __name__=="__main__":
    url = ""
    html = ""
    main_html = extract(url, html)

Pipeline

  1. HTML pre-dedup
  2. domain clustering
  3. layout clustering
  4. typical layout node selection
  5. HTML node select by LLM
  6. html parse layout by layout

Usage

TODO

Known Issues

FAQ

contributors

contributors

License Information

Acknowledgments

Citation

Star History

Star History Chart

links

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 21