This repo contains several chain velds encapsulating training and evaluating static word embedding architectures on the Austria Media Corpus.
Data is only persisted in private gitlab repos due to licensing and storage issues. This public chain repo contains only pointers to public metadata repos.
Models are currently also only persisted internally while work is in progress. Once models achieve acceptable performance, they will be published to huggingface.
This is the list of code and data velds, integrated into chain velds within this repository:
- : fastText code
- : GloVe code
- : word2vec code
- : Evaluation code of all wordembeddings.
- : Preproccesing code
- : AMC training data, in all its preprocessed combinations. The public branch only contains metadata.
- : trained fastText models. Due to storage issues, contains only metadata.
- : trained GloVe models. Due to storage issues, contains only metadata.
- : trained word2vec models. Due to storage issues, contains only metadata.
- : Aggregation of evaluation and analysis on performance of trained models.
- git
- docker compose (note: older docker compose versions require running
instead ofdocker compose
Clone this repo with all its submodules
git clone --recurse-submodules
Several chain velds are used for training models in varying configurations, thus there is no "one single overarching workflow". However the patterns are roughly:
- preprocessing
- training
- evaluation
- analysis
Each chain veld yaml represents one persisted workflow at one point in time. See inside these following yaml files for more details.
The entire AMC data is preprocessed in combinations of these preprocessing chains:
Removes lines that don't reach a threshold regarding the ratio of textual content to non-textual (numbers, special characters) content.
docker compose -f veld_preprocess_clean.yaml up
Makes entire text lowercase.
docker compose -f veld_preprocess_lowercase.yaml up
Removes punctuation from text with spaCy pretrained models.
docker compose -f veld_preprocess_remove_punctuation.yaml up
Takes a random sample of lines from a txt file. Randomness can be set with a seed too.
docker compose -f veld_preprocess_sample.yaml up
Removes all lines before and after given line numbers.
docker compose -f veld_preprocess_strip.yaml up
The following chains encapsulate training workflows:
A fasttext training setup.
docker compose -f veld_train_fasttext.yaml up
A GloVe training setup.
docker compose -f veld_train_glove.yaml up
A word2vec training setup.
docker compose -f veld_train_word2vec.yaml up
The following chains encapsulate evaluation workflows:
Custom evaluation logic on fasttext word embeddings.
docker compose -f veld_eval_fasttext.yaml up
./veld_eval_glove.yaml Custom evaluation logic on GloVe word embeddings.
docker compose -f veld_eval_glove.yaml up
Custom evaluation logic on word2vec wordembeddings.
docker compose -f veld_eval_word2vec.yaml up
These chains analyse the training and evaluation contexts into condensed comprehensive overviews and statistics:
Launches a jupyter notebook with various analysis and visualization steps.
docker compose -f veld_analyse_evaluation.yaml up
Executes the jupyter notebook code non-interactively, mainly for persisting statistics and visualizations as versioned files.
docker compose -f veld_analyse_evaluation_non_interactive.yaml up