open-gram is a project tries to collect lexicon and build n-gram dataset for NLP in Chinese. This project tries to leverage existing open source resources like crfpp and CC-CEDICT.
- open-gram includes 4 parts
- corpus collection
- segmentation
- (new) word extraction
- n-gram info counting
- crawl Chinese web sites using scrapy, grab the body HTML pages of them
- proprocess the pages - detect the encoding - remove HTML tags and other stuff we are not interested in - split the text into sentences
- there two ways to segment tokens into words
- tagging
- matching