This repository contains the simulation code underlying the paper Loukina, A., Madnani, N. Cahill, A., Johnson, M.S., Riordan, B., and McCaffrey, D.F. Using PRMSE to evaluate automated scoring systems in the presence of label noise, Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, Seattle, 2020. 18--29.
-
Create a conda environment called by running
conda env create -f environment.yml
. This command might take a few minutes. Please wait for it to finish. -
Activate this new environment by running
conda activate prmse
.
The simulated dataset used in the paper is stored as a combination of .csv
files under data
:
-
scores.csv
- simulated human, machine, and true scores -
rater_metadata.csv
- information about each simulated "human" rater -
system_metadata.csv
- information about each simulated "system".
These .csv
files are provided for reference and will not be overwritten by the notebook.
The same dataset is also stored as default.dataset
file: a serialized instance of the Dataset
class used in all simulations (see notes section below ). This file will be overwritten if you make changes to the notebooks or to the settings.
The code for the simulations is divided into a set of Jupyter notebooks under the notebooks
directory.
-
making_a_dataset.ipynb
. This is the notebook used to create a simulated dataset using the dataset parameters stored innotebooks/dataset.json
. In addition to creating the dataset, it also contains some preliminary analyses on the dataset to make sure that it behaves as expected. This notebook serializes the dataset and saves it underdata/default.detaset
. This serialized dataset file is then used by the subsequent notebooks to load the dataset. Therefore, changing the parameters indataset.json
and re-running this notebook will change the results of the analyses in the other notebooks. -
multiple_raters_true_score.ipynb
. In this notebook, we explore the impact of using a larger number of human raters in the evaluation process. More specifically, we show that as we use more and more human raters, the average of the scores assigned by said raters approaches the true score. In addition, we show that when evaluating a given automated system against an increasing number of human raters, the values of the conventional agreement metrics approach values that would be computed if that same system were to be evaluated against the true score. -
metric_stability.ipynb
. In this notebook, we compare the stability of conventional agreement metrics such as Pearson's correlation, quadratically-weighted kappa, mean squared error, and R^2 to that of the proposed PRMSE metric. We do this by showing that the usual agreement metrics can give very different results depending on the pair of human raters that are used as the reference against which the system score is evaluated. However, the PRMSE metric yields stable evaluation results across different pairs of human raters. -
ranking_multiple_systems.ipynb
. In this notebook, we explore how to rank multiple automated scoring systems. Specifically, we consider the situation where we have scores from multiple different automated scoring systems, each with different levels of performance. We evaluate these systems against the same as well as different pairs of raters and show that while all metrics can rank the systems accurately when using a single rater pair for evaluation, only PRMSE can do the same when a different rater pair is used for every system. -
prmse_and_double_scoring.ipynb
. In this notebook, we explore the impact of the number of double-scored responses on PRMSE. We know that in order to compute PRMSE, we need at least some of the responses to have scores from two human raters. However, it may not be practical to have every single response double-scored. In this notebook, we examine how PRMSE depends on the number of double-scored responses that may be available in the dataset.
If you are interested in running your own PRMSE simulations, you need to:
-
Edit the
dataset.json
file to change any of the following dataset-specific settings:- the number of responses in the dataset (
num_responses
) - the distribution underlying the true scores and the total number of score points (
true_score_mean
,true_score_sd
,min_score
, andmax_score
) - new categories of simulated human raters and automated systems (
rater_categories
,rater_rho_per_category
,system_categories
, andsystem_r2_per_category
) - the number of simulated rater and/or systems per category (
num_raters_per_category
andnum_systems_per_category
)
- the number of responses in the dataset (
-
Run the
making_a_dataset.ipynb
notebook to create and save your new dataset instance asdata/default.dataset
. -
Edit the
settings.json
file to change any of the following notebook-specific simulation settings:double_scored_percentages
: the percentages of double-scored responses that are simulated inprmse_and_double_scoring.ipynb
.key_steps_n_raters
: the number of raters included in the cumulative calculations inmultiple_raters_true_score.ipynb
.rater_pairs_per_category
: the pre-determined number of rater pairs per category used inmetric_stability.ipynb
,ranking_multiple_systems.ipynb
, andprmse_and_double_scoring.ipynb
.sample_system
: the simulated automated scoring system chosen as the source of automated scores inmultiple_raters_true_score.ipynb
andmetric_stability.ipynb
.
-
Run the notebooks to see how PRMSE performs for your simulation settings.
-
Note that the structure and order of the notebooks does not necessarily follow the order of analyses in the paper. For example, in the paper we first show the gaps in traditional metrics and then demonstrate that PRMSE can help address those. However, in the notebooks, it is more efficient to keep the analyses with and without PRMSE in the same notebook as long as they use the same data.
-
For efficiency and readability reasons, a lot of code shared by the notebooks is factored out into a package called
simulation
found undernotebooks/simulation
. This package contains two main Python files:-
simulation/dataset.py
. This module contains the mainDataset
class representing the simulated dataset underlying all of the PRMSE simulations. -
simulation/utils.py
. This module contains several utility functions needed for the various simulations in the notebooks.
-
-
Running the
making_a_dataset.ipynb
also saves three CSV files, one for each of the data frames that can be obtained by calling theto_frames()
method on the dataset instance saved indata/default.dataset
. Between themsleves, these 3 CSV files contain all of the simulated scores as well as the rater and system metadata. For a detailed description of each data frame, see the docstring for theto_frames()
method of theDataset
class. We make these CSV files available underdata
so that they can be examined and modified in other programs such as Excel and R. However, making changes to these CSV files will not affect any analyses in any of the other notebooks as they use thedata/default.dataset
file and not these CSV files.
The code and data in this repository are released under the MIT license.