Skip to content

Latest commit

 

History

History
56 lines (46 loc) · 1.92 KB

README.md

File metadata and controls

56 lines (46 loc) · 1.92 KB

jieba-hs

Haskell CI

TODO:

  • Serious performance issues, see Main.hs for benchmarking program, do some profiling.
  • Most likely due to personal ineptitude with lists and O(n^2) concat

In Beta!

Things that are missing: TF-IDF, Modification of the dictionary at runtime, and async. At least the things that are glaringly obvious. A lot more is missing.

「結巴-hs」是「結巴」中文分詞的Haskell版本。

"Jieba-hs" is an implementation of the "Jieba" word segmentation library for Chinese in Haskell.

使用 Usage

jieba-hs的字典格式與jieba的一模一樣 (原版)。 字典在data/*。HMM Model看hmm.model, 是從cppjieba借的。

The format of the dictionaries are the same as jieba, see the (original). The HMM Model is borrowed from cppjieba with a few slight modifications.

import System.IO
import Dictionary
import Jieba
import Data.List (intercalate)

main :: IO ()
main = do
    contents <- hGetContents =<< openFile "dict.txt.small" ReadMode
    let dict = dictFromContents contents
    let hmmd <- readHmmDict "data/hmm.model"
    let snt = "他来到了网易杭研大厦"
    let result = cutNoHMM dict snt
    let result' = cutHMM dict hmmd snt
    let result'' = cutAll dict snt
    putStrLn $ intercalate "/" result
    putStrLn $ intercalate "/" result'
    putStrLn $ intercalate "/" result''
*Main> main
他/来到/了/网易/杭/研/大厦 -- No HMM
他/来到/了/网易/杭研/大厦 -- With HMM
他/来/来到/到/了/网/网易/易/杭/研/大/大厦/厦 -- All possible cuts

TODO

  • TF-IDF
  • 使用Trie?
  • QuickCheck unit tests