This is a repository of code used in paper Assisting Human Decisions in Document Matching.
The data points used for the task is sampled from CNN/DailyMail.
python collect_data.py
to collect (summary, candidate articles) pairs where the articles are picked according to their similarity to the summary. The collected information will be saved in the data/
directory.
We manually modified the a subset of collected summary and candidate articles above so that a single ground-truth answer is guaranteed for each questions (refer to Section 3 of the paper for more details on how we modified the existing data points to better fit the task requirements).
We include two datasets consisting of multiple questions for easy (easy-ans-samples-cnndm.pkl
) and hard (data/hard-ans-samples-cnndm.pkl
) types here under the data/
folder. Questions sampled from these datasets were used for the user study.
Each dataset has the following format.
{
question_id (int) : {
'original_text' : ground-truth article (str),
'correct_summary' : query summary (str),
'wrong_text' : wrong candidate articles (list of str)
'score_t1_s1' : affinity score for the ground-truth article (numpy array),
'score_t2_s1' : affinity scores for the wrong candidate articles (numpy array),
'correct' : whether the ground-truth article has the highest affinity score (bool)
'ptype' : question type, either 'easy' or 'hard' (str)
}
}
We additionally include a meta-data about each questions in data/qfile.pkl
here, with the following format:
{
question_id (int) : {
'type': question type, either 'easy-ans' or 'hard-ans' (str),
'ans': correct answer, one of 1, 2, or 3,
'score-correct': whether the ground-truth article has the highest affinity score (bool)
}
}
You can see the questions as presented to the users in the zip file data/questions.zip
here, with corresponding question ids included in the html filename.
Running each method gives a metadata used for visualization of highlights that will be saved in output/
directory. You can download the outputs used for the experiments here under the output/
folder.
We used the implementation of SHAP here. For other methods (input-gradient, integrated gradients), refer to scripts/compute_explanations.py
.
We used the implementation of BERTSum here to extract summaries for the articles. scripts/add_bertsum.py
converts the raw output into the metadata format we use for visualization. Refer to the scripts in bertsum_scripts/
directory on how BERTSum was run.
Co-occurrence method computes the similarity between sentences based on the F1-score of ROUGE-L metric. Refer to scripts/add_cooccur.py
.
Semantic method computes the similarity between sentences based on the representation learned by SentenceBERT. Refer to scripts/add_semantic.py
.
Using the metadata generated by each methods above, scripts/generate_samples.py
generates HTML files that visualize the summary article pairs and highlights. These files are used to present the users with assistive information in the user study.
Here, we share the responses from the users and statistics related to their task performance. All identifiable information has been completely removed, and each user is now randomly assigned with an integer user_number
ranging fron 1 to 271. We share three files:
user-response.pkl
: a dictionary of individual user responses for the questions. It has the following format:
{
user_number (int) : {
'type': experiment group the user is assigned to (str, one of {control, shap, bertsum, cooccur, semantic}),
'probs': list of question ids the user was given (list of int),
'correct': binary list indicating whether the user answered each question correctly (list of int)
'time' : list of time in seconds the user took to answer each question (list of float),
'ans' : list of answers given by the user (list of int, value one of {0,1,2,3}, representing the candidate article number selected; 0 means no article selected)
}
}
user-stats.csv
: a csv file containing user statistics based on individual responses for the questions.user-qualitative.csv
: a csv file containing qualitative responses from each users.
notebooks/eval.ipynb
: evaluates black-box model explanations based using EM distances.notebooks/power-analysis.pynb
: Monte-Carlo power anlaysis for finding the effective sample size for the user studynotebooks/analysis.ipynb
: test hypotheses, analyze, and plot the results based on user study results.
Instructions for running each part of the code is described in steps in scripts/run.sh
.