- corpus_prep.py -- Set up the initial data preparation process and define the logic of the BMES marker.
- Matrix_generator.py -- Generate the transfit matrix, emit matrix, and head matrix (h-mat is unnecessary yet) based on the corpus marked within BMES symbol.
- HMM_Model.py -- Based on the matrix, mining the new terms in the testing corpus using viterbi. Users may use this to do term-mining sentence by sentence to fit some downstreaming NLP work
- word_filter.py -- Handle some word-filterring work like short/long terms' overlap instances and low-document-frequency instances
- main.py -- This is the method to mining new terms with a whole corpus document, so the model will dig out all the underlay proper new terms within a big set
- odd_handle.py -- Used to check some odd new-terms, to inspect how the words located in all related sentence, what the attributes of components of the new-terms are. Thus, users can add some artificial new-terms by hand, so the model's corpus knowledge will be more sufficient.(This can be regarded as a remedial method)
- old: The domain terms we already have, to sharpen our jieba tokenizer (path: ./data)
- new: The new corpus file you want to dig the new-terms, originally its in pd.dataframe format (path: ./data)
- corp_col: The column name of corpus in the pd.dataframe file
- corp_file: The file name of the corpus cleaned by code 1.
- keep_stopwords: boolean, to direct the model whether keep the stopwords. Once you change the parameter, please rerun the whole workflow 1. to 2.
- threshold: int, to set up the threshold number of overlap-words filter