Skip to content

Chain velds encapsulating training and evaluating static word embedding architectures on the Austria Media Corpus.

License

Notifications You must be signed in to change notification settings

veldhub/veld_chain__train_infer_wordembeddings_multiple_architectures__amc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

veld chain veld_chain__train_infer_wordembeddings_multiple_architectures__amc

This repo contains several chain velds encapsulating training and evaluating static word embedding architectures on the Austria Media Corpus.

Data is only persisted in private gitlab repos due to licensing and storage issues. This public chain repo contains only pointers to public metadata repos.

Models are currently also only persisted internally while work is in progress. Once models achieve acceptable performance, they will be published to huggingface.

intergrated code and data velds

This is the list of code and data velds, integrated into chain velds within this repository:

requirements

  • git
  • docker compose (note: older docker compose versions require running docker-compose instead of docker compose)

Clone this repo with all its submodules

git clone --recurse-submodules https://github.com/veldhub/veld_chain__train_infer_wordembeddings_multiple_architectures__amc.git

how to reproduce

Several chain velds are used for training models in varying configurations, thus there is no "one single overarching workflow". However the patterns are roughly:

  • preprocessing
  • training
  • evaluation
  • analysis

Each chain veld yaml represents one persisted workflow at one point in time. See inside these following yaml files for more details.

preprocessing

The entire AMC data is preprocessed in combinations of these preprocessing chains:

./veld_preprocess_clean.yaml

Removes lines that don't reach a threshold regarding the ratio of textual content to non-textual (numbers, special characters) content.

docker compose -f veld_preprocess_clean.yaml up

./veld_preprocess_lowercase.yaml

Makes entire text lowercase.

docker compose -f veld_preprocess_lowercase.yaml up

./veld_preprocess_remove_punctuation.yaml

Removes punctuation from text with spaCy pretrained models.

docker compose -f veld_preprocess_remove_punctuation.yaml up

./veld_preprocess_sample.yaml

Takes a random sample of lines from a txt file. Randomness can be set with a seed too.

docker compose -f veld_preprocess_sample.yaml up

./veld_preprocess_strip.yaml

Removes all lines before and after given line numbers.

docker compose -f veld_preprocess_strip.yaml up

training

The following chains encapsulate training workflows:

./veld_train_fasttext.yaml

A fasttext training setup.

docker compose -f veld_train_fasttext.yaml up

./veld_train_glove.yaml

A GloVe training setup.

docker compose -f veld_train_glove.yaml up

./veld_train_word2vec.yaml

A word2vec training setup.

docker compose -f veld_train_word2vec.yaml up

evaluation

The following chains encapsulate evaluation workflows:

./veld_eval_fasttext.yaml

Custom evaluation logic on fasttext word embeddings.

docker compose -f veld_eval_fasttext.yaml up

./veld_eval_glove.yaml Custom evaluation logic on GloVe word embeddings.

docker compose -f veld_eval_glove.yaml up

./veld_eval_word2vec.yaml

Custom evaluation logic on word2vec wordembeddings.

docker compose -f veld_eval_word2vec.yaml up

analysis

These chains analyse the training and evaluation contexts into condensed comprehensive overviews and statistics:

./veld_analyse_evaluation.yaml

Launches a jupyter notebook with various analysis and visualization steps.

docker compose -f veld_analyse_evaluation.yaml up

./veld_analyse_evaluation_non_interactive.yaml

Executes the jupyter notebook code non-interactively, mainly for persisting statistics and visualizations as versioned files.

docker compose -f veld_analyse_evaluation_non_interactive.yaml up

About

Chain velds encapsulating training and evaluating static word embedding architectures on the Austria Media Corpus.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published