Create a virtual environment.
conda create -n eppi_text python=3.11
conda activate eppi_text
Install.
pip3 install -e .
python3 -m spacy download en_core_web_sm
If you wish to run tests, you will need to install the test dependencies.
pip3 install -e ".[test]"
Type: String
Description The path to the labelled data tsv, relative to the selected working container url of the production blob storage as set by the
working_container_urlinput. This labelled data is used for training the model, finding the best model via n-fold cross-validation and generating statistics on the performance of the trained model.
Implementation: MAY_REQUIRE_INTERNAL_LOGIC
Type: String
Description: The path to the unlabelled data tsv, relative to the selected working container url of the production blob storage as set by the
working_container_urlinput. This unlabelled data will be classified by the trained model.
Implementation: MAY_REQUIRE_INTERNAL_LOGIC
Type: String
Description: The header of the title column in the tsv file.
Implementation: SET_FROM_DATA_FACTORY
Type: String
Description: The header of the abstract column in the tsv file.
Implementation: SET_FROM_DATA_FACTORY
Type: String
Description: The header of label column in the tsv file.
Implementation: MAY_REQUIRE_INTERNAL_LOGIC
Type: String
Description: The value in column headed by
label_headerthat the model should consider a positive.
Implementation: MAY_REQUIRE_INTERNAL_LOGIC
(Need to give some more details on why I think that the two above may require internal logic)
Type: String
Description: The name of the model that the user would like to train for classification.
Choices: lightgbm | RandomForestClassifier | xgboost | SVC
Implementation: EXPOSE_IMMEDIATELY
notes: a different type of xgboost can be selected by carefully selecting hyperparameter ranges. It is very good so I will probably expost it as it's own name when I get around to it. It will be called "xgboostLinear"
Type: String
Description: A path to json with the hyperparameter search ranges for selecting the model. This follows a particular format that I will document later.
Implementation: ADVANCED
Type: Integer
Description: The number of iterations of hyperaparameter search the user would like to do. The best model will be selected out of all iterations.
Implementation: ADVANCED
Type: Integer
Description: The number of folds to use in cross-validation.
Implementation: ADVANCED
Type: Integer
Description: When n-fold cross-valdation is done with a small number of samples, the result can vary massively based on how the folds are selected. This can result in suboptimal models being selected due to the instability of the method. To combat this, we can repeat cross-validation with many different seeds to reduce the statistical variation.
Implementation: ADVANCED
Type: Integer
Description: The time in seconds after which the search will be terminated if it is still running.
Implementation: ADVANCED
Type: Boolean
Description: A gaussian mixture model that measures the statistical variation of results and measure the likelihood that a better model exists. When that likelihood is sufficiently low, it terminates the search.
Implementation: ADVANCED
Type: Integer
Description: If a the search does not find a new best model after this many iterations since the last best model, then the search terminates.
Implementation: ADVANCED
Type: Number
Description: When set, a wilcoxon trial is performed on the results of the hyperparameter search vs the current best. If the new searches first few results are significantly worse, it is pruned.
Implementation: ADVANCED
Type: Boolean
Description: When set to true, if the first two results a cross-validation iteration of the hyperparameter search are worse than the best result, then the trial is pruned. Results in atleast 33% time reduction of search.
Implementation: ADVANCED
Type: Integer
Description: The top
shap_num_displayfeatures will be used in the in the shap model explainability plots.
Implementation: ADVANCED
Type: String
Description: The url of the working container in the production blob storage. All paths, such as
labelled_data_path,unlabelled_data_pathandoutput_container_pathare relative to this. The pipeline will not have access anything in the blob storage that is not nested within this container.
Implementation: MAY_REQUIRE_INTERNAL_LOGIC
Type: String
Description: The path to save data and results of the pipeline. This should be the relative path to the working_container_url.
Implementation: MAY_REQUIRE_INTERNAL_LOGIC
Type: String
Description: The id of the managed identity. This managed identity provides the cluster with access to the production blob storage.
Implementation: SET_FROM_DATA_FACTORY
Description: A directory containing all the plots from the find_single_model pipeline.
Description: A database containing the information of the hyperparameter search. This doesn't really need to be saved at the moment, but it may be useful in the future if we wanted to have a continue search option if the model wasn't good enough.
Description: A file containing the tfidif scores of the labelled data.
Description: A file containing the tfidif scores of the unlabelled data.
Description: A file containing the names of the features that are used as columns in the tfidf array. They are ordered in the same order as the tfidf arrays in the .npz files.
Description: A file containing an array of the labels of the labelled data in order of the rows of the labelled_tfidf.npz.
Description: A file containing the hparams of the best performing model from the hyperparameter search.
Description A directory which contains the trained model. The file type varies based on which model is used.