From 5d76f6574dba2944970e7c7d8bdbbe474bb6a0e9 Mon Sep 17 00:00:00 2001 From: seanzhangkx8 <106214464+seanzhangkx8@users.noreply.github.com> Date: Fri, 14 Jun 2024 15:13:46 -0400 Subject: [PATCH] CMV Corpus doc - fix metadata list, fix main page access --- README.md | 2 +- docs/source/awry_cmv.rst | 10 +++++----- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 1cb12d0b..14fdc334 100644 --- a/README.md +++ b/README.md @@ -57,7 +57,7 @@ Available as an interactive notebook: [full version (fine-tuning + inference)](h ConvoKit ships with several datasets ready for use "out-of-the-box". These datasets can be downloaded using the `convokit.download()` [helper function](https://github.com/CornellNLP/ConvoKit/blob/master/convokit/util.py). Alternatively you can access them directly [here](http://zissou.infosci.cornell.edu/convokit/datasets/). -### [Conversations Gone Awry Datasets]([Wikipedia](https://convokit.cornell.edu/documentation/awry.html)/[CMV](https://convokit.cornell.edu/documentation/awry_cmv.html)) +### Conversations Gone Awry Datasets ([Wikipedia](https://convokit.cornell.edu/documentation/awry.html)/[CMV](https://convokit.cornell.edu/documentation/awry_cmv.html)) Two related corpora of conversations that derail into antisocial behavior. One corpus (CGA-WIKI) consists of Wikipedia talk page conversations that derail into personal attacks as labeled by crowdworkers (4,188 conversations containing 30.021 comments). The other (CGA-CMV) consists of discussion threads on the subreddit ChangeMyView (CMV) that derail into rule-violating behavior as determined by the presence of a moderator intervention (6,842 conversations containing 42,964 comments). Name for download: `conversations-gone-awry-corpus` (for CGA-WIKI) or `conversations-gone-awry-cmv-corpus` (for CGA-CMV) diff --git a/docs/source/awry_cmv.rst b/docs/source/awry_cmv.rst index 765ac26f..7c1abf53 100644 --- a/docs/source/awry_cmv.rst +++ b/docs/source/awry_cmv.rst @@ -52,11 +52,11 @@ Metadata for each conversation include: * has_removed_comment: whether the final comment in this thread was removed by CMV moderators for violation of Rule 2 * split: which split (train, val, or test) this conversation was used in for the experiments described in "Trouble on the Horizon" * summary_meta: metadata related to conversation summaries, a list of dictionaries (one per summary available, possibly empty) with the following keys: - * * summary_text: the text of the summary; - * * summary_type: whether the summary is humman written by humans;(human_written_SCD) or generated automatically using the procedural prompt ("machine_generated_SCD") ; - * * up_to_utterance_id: the last utterance considered when creating the summary; - * * truncated_by: the number of utterances the transcript was truncated by when creating the summary (starting from the end); - * * scd_split: whether the summary was in the train/test/validation split in the 2024 Summarizing Conversations Dynamics paper; + * summary_text: the text of the summary; + * summary_type: whether the summary is humman written by humans;(human_written_SCD) or generated automatically using the procedural prompt ("machine_generated_SCD") ; + * up_to_utterance_id: the last utterance considered when creating the summary; + * truncated_by: the number of utterances the transcript was truncated by when creating the summary (starting from the end); + * scd_split: whether the summary was in the train/test/validation split in the 2024 Summarizing Conversations Dynamics paper; Usage