-
Notifications
You must be signed in to change notification settings - Fork 2
Home
FilipBolt edited this page Sep 17, 2019
·
19 revisions
Welcome to TakeLab Podium - Python machine learning library that helps users to accelerate use of natural language processing models.
This wiki is the main source of documentation for developers working with (or contributing to) the TakeLab Podium project.
Podium goal is described in next figure.
Data part of podium starts with Dataset definition which is composed by using Examples and Fields.
Every Field can have it's own vocabulary about which you can find more here.
Iteration through dataset is defined by Iterators.
Preprocessing utilities are defined as part of the preproc submodule. Here, one can find some typical natural language processing utilities:
- tokenizers -- divide a single string into a list of tokens,
- stop words lists -- words that are typically omitted when building NLP models
- lemmatizers -- procedures to determine canonical form of word
- stemmers -- procedures to determine the root of a word
Typically, preprocessing is defined as hooks, which are executed when data is loaded
- Large resource - if you need to use or make a class that downloads a large resource from a server to takepod resources folder
- Logging - if you need good logging from podium modules