Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Dec 11, 2024 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
文本挖掘和预处理工具(文本清洗、新词发现、情感分析、实体识别链接、关键词抽取、知识抽取、句法分析等),无监督或弱监督方法
🧹 Python package for text cleaning
E2M converts various file types (doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, m4a) into Markdown. It’s easy to install, with dedicated parsers and converters, supporting custom configs. E2M offers an all-in-one, flexible, and open-source solution.
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
Tools for cleaning and normalizing text data
Grammarify is a npm package that safely cleans up text that has mispellings, improper capitalization, lexical illusions, among other things.
NLP预/后处理工具。
A Python toolkit for file processing, text cleaning and data splitting. 文件处理,文本清洗和数据划分的python工具包。
Text preprocessing tools in python.
Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼
Text Preprocessing Package includes cleaning, tokenization, dataset preparation ...etc
Korean text data preprocess toolkit for NLP
A Python package to get useful information from documents using TopicRank Algorithm.
Text preprocessing package for use in NLP tasks https://pypi.org/project/textcl/
JS / Python3 / PHP Lib to work with UTF8 polytonic greek and latin
4th place (top 1%) solution for Shopee Code League 2020 - Product Detection
Common Text Pre-Processing for Portuguese
Remove extra whitespace from text.
Corpora and scripts for cleaning political science texts. Scripts are translated into transformations that support SAGE Texti.
Add a description, image, and links to the text-cleaning topic page so that developers can more easily learn about it.
To associate your repository with the text-cleaning topic, visit your repo's landing page and select "manage topics."