2019 N2C2 Track-1 Clinical Semantic Textual Similarity

best RoBERTa model (0.9065)

https://transformer-models.s3.amazonaws.com/2019n2c2_tack1_roberta_pt_stsc_6b_16b_3c_8c.zip

Environment

Python 3.7.3
Pytorch 1.1.0
Transformers 2.5.1

Dataset

General corpus: Semantic Textual Similarity Benchmark dataset from GLUE Benchmark download
Clinical corpus: 2019 N2C2 Challenge Track 1 website

Preprocess

Preprocess clinical dataset

python preprocess/prepro.py \
  --data_dir=path/to/clinical_sts_dataset \
  --output_dir=dir/to//output/clinical_sts_dataset

Generate datasets for five fold cross validation

python preprocess/cross_valid_generate.py \
  --data_dir=path/to/processed_dataset \
  --output_dir=dir/to/output

Training

Training and prediction processes are provided in the following scripts:
single.sh Using a single model
ensemble.sh Using multi-model ensemble

Evaluating 5 fold cross validation results

Use the script cv_eval.sh to get the best hyperparameters (batch size and epoch number) based on the results of 5 fold cross validation.

Args

  --input_dir path    directory containing the results of 5 fold cross validation
  --output_dir path   directory to output the evaluation result

Models

Theoretically support all models in https://huggingface.co/transformers/pretrained_models.html. However, we only used Bert, Roberta and XLNet in this task.

Citation

please cite our paper:

https://medinform.jmir.org/2020/11/e19735/

Yang X, He X, Zhang H, Ma Y, Bian J, Wu Y
Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models
JMIR Med Inform 2020;8(11):e19735
DOI: 10.2196/19735

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
preprocess		preprocess
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cv_eval.sh		cv_eval.sh
ensemble.sh		ensemble.sh
prepro.sh		prepro.sh
single.sh		single.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2019 N2C2 Track-1 Clinical Semantic Textual Similarity

best RoBERTa model (0.9065)

Environment

Dataset

Preprocess

Training

Evaluating 5 fold cross validation results

Args

Models

Citation

About

Releases

Packages

Languages

License

ortega-miguel/2019_N2C2_Track1_ClinicalSTS

Folders and files

Latest commit

History

Repository files navigation

2019 N2C2 Track-1 Clinical Semantic Textual Similarity

best RoBERTa model (0.9065)

Environment

Dataset

Preprocess

Training

Evaluating 5 fold cross validation results

Args

Models

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages