GitHub - XueWei/open-gram: collect lexicon and build n-gram dataset for NLP in Chinese

open-gram

open-gram is a project tries to collect lexicon and build n-gram dataset for NLP in Chinese. This project tries to leverage existing open source resources like crfpp and CC-CEDICT.

open-gram includes 4 parts

corpus collection
segmentation
(new) word extraction
n-gram info counting

corpus collection

crawl Chinese web sites using scrapy, grab the body HTML pages of them
proprocess the pages - detect the encoding - remove HTML tags and other stuff we are not interested in - split the text into sentences

segmentation

there two ways to segment tokens into words * tagging * matching

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
build		build
data		data
lexicon		lexicon
segment/tagging		segment/tagging
tools/CRF++-0.53		tools/CRF++-0.53
.gitignore		.gitignore
README.rst		README.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

open-gram

corpus collection

segmentation

word extraction

n-gram info counting

About

Releases

Packages

XueWei/open-gram

Folders and files

Latest commit

History

Repository files navigation

open-gram

corpus collection

segmentation

word extraction

n-gram info counting

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages