This repository is the code for Team Innovator's submission of SemEval 2022 Task 8 paper titled "Multi-Task Training with Hyperpartisian and Semantic Relation for Multi-Lingual News Article Similarity". The shared task emphasizes finding the similarity of multilingual news articles irrespective of the style of writing, political spin, tone, or any othermore subjective "design decision" imposed by a medium/outlet. We propose a pipeline consisting of text rank to filter irrelevant information followed by a multi-task approach which allows multiple sub-tasks to share the same encoder during training thereby facilitating knowledge transfer.
The model is trained on multiple subtasks as outlined below. The results are evaluated on SemEval dataset found here.
The SemEval dataset consists of a csv file with each row corresponding to a pair of article. For each article, url_lang
, link
and id
is mentioned. Along with it, the similarity score across Geography
, Entities
, Time
, Narrative
, Style
, Tone
and Overall
are mentioned. The final evaluation is done on the Overall
similarity. The content of the news article is extracted using the script given here.
Subtask | Description | Dataset |
---|---|---|
Semantic Textual Similarity | Determine how semantically similar two pieces of text are. | STS benchmark |
Hyperpartisan detection | Given a news article, decide whether it follows a hyperpartisan argumentation, i.e., whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person. | Hyperpartisan News Detection |
Stance detection | It involves estimating the relative perspective (or stance) of two pieces of text respective to a topic, claim or issue. | Fake News Challenge - 1 |
Fake news inference detection | Fake news Detection using the Natural Language Inference. This entails categorizing a piece of text into categories such as "pants-on-fire", "false", "barely true", "half-true", "mostly true", and "true." | Fake News Inference Dataset |
Paraphrase detection | Determine whether a particular sentence is a paraphrase of the original text. | Microsoft Research Paraphrase Corpus |
The preprocessed version of the above datasets are available under dataset
folder and some are used directly through hugging-face glue-dataset so there is no need to download the datasets.
The models can be run locally by cloning the current repository or through google colab using the following links. The pearson score reported are for the validation dataset during training.
Model | Pearson Score | Link |
---|---|---|
Main Model: Multi-task Training | 0.835 | |
Experiment 1: Multi-Objective Weighted Loss Training | 0.811 | |
Experiment 2: Multi-task Training with Multilingual Text Rank | 0.737 |
- Clone the current repository and upload it in google drive
- Open the concerning notebook in the training module folder and enable GPU access
- Connect the notebook to your google drive. You can see the tutorial here
- Install the dependencies mentioned in the initial cells and the rest of the cells
- Nidhir Bhavsar* (Navrachana University, Gujarat, India): [email protected]
- Rishikesh Devanathan* (Indian Institute of Technology Patna, India) [email protected]
- Aakash Bhatnagar* (Navrachana University, Gujarat, India): [email protected]
- Tirthankar Ghosal (UFAL, MFFCharles University, Czech Republic): [email protected]
- Muskaan Singh (IDIAP Research Institute, Switzerland)
* denotes equal contribution