RuTransform is a Python framework for adversarial attacks and text data augmentation for Russian.
- Setup & Usage
- Examples
- Dataset Transformation on sample data
- Dataset Transformation on your own data
- Custom Constraints
- Sentence Transformations
- Custom Transformations
- Framework Design
- Cite us
Python >= 3.7
git clone https://github.com/RussianNLP/rutransform
cd rutransform
pip install .
Dataset Transformation (sample data)
For the full list of supported transformations with examples see supported transformations.
import pandas as pd
from rutransform.transformations import DatasetTransformer
from rutransform.utils.args import TransformArguments
# load data
dataset = pd.read_json('test_data/worldtree.json', lines=True)
# load arguments
transformation = "butter_fingers"
probability = 0.3
args = TransformArguments(transformation=transformation, probability=probability)
# init dataset transformer
tr = DatasetTransformer(
dataset=dataset,
text_col='question',
task_type='multichoice_qa',
args=args,
return_type='pd' # format of the resulting dataset (default is 'hf')
)
# run transformation
output = tr.transform()
DatasetTransformer
outputs a named tuple with dataset similarity scores:
score = output.score # mean dataset similarity score: 0.9146944761276246
scores = output.scores # similarity scores for each sentence: array([0.93971652, 0.94295949, 0.8272841 , 0.98828816, 0.87522411])
std = output.std # std of the similarity scores: 0.05663837594035781
...and the transformed dataset:
transformed_data = output.transformed_dataset
print('Original data:\n', dataset['question'][0])
print('Transformed data:\n', transformed_data['question'][0])
Original data:
Когда мороженое не кладут в морозильную камеру, мороженое превращается из ___. (A) твердого тела в газ (B) газа в жидкость (C) твердого тела в жидкость (D) жидкости в газ
Transformed data:
Когда мороженое не кладут в морозильную камеру, мороженое превращается из ___. (A) твердого тела в газ (B) газа в жидвость (C) мвердого тела в дидкость (D) жидкости в гкз
Dataset Transformation (own data)
RuTransform can easily be adapted to other tasks. To use the framework on your own data, simply specify the text (text_col
) and/or target (label_col
) column names and choose the suitable constraints (pass them into the custom_constraints
argument). For example, to run transformation on the DaNetQA data (Shavrina et al,. 2020) we choose to perturb the passage
text and use the NamedEntities
constraint to preserve proper nouns:
import pandas as pd
from rutransform.transformations import DatasetTransformer
from rutransform.utils.args import TransformArguments
from rutransform.constraints import NamedEntities
# load data
dataset = pd.read_json('test_data/danet_qa.json', lines=True)
# init arguments
transformation = "back_translation"
probability = 0.5
args = TransformArguments(transformation=transformation, probability=probability)
# init dataset transformer
tr = DatasetTransformer(
dataset=dataset,
text_col='passage',
args=args,
return_type='pd',
custom_constraints=[NamedEntities()],
device='cuda:0'
)
# run transformation
output = tr.transform()
print('Original data:\n', dataset['passage'][4],)
print('Transformed data:\n', output.transformed_dataset['passage'][4])
Original data:
Средство коммуникации. В своей простейшей форме искусство представляет собой средство коммуникации. Как и большинство прочих способов коммуникации, оно несет в себе намерение передать информацию аудитории. Например, научная иллюстрация — тоже форма искусства, существующая для передачи информации. Ещё один пример такого рода — географические карты. Однако содержание послания не обязательно бывает научным. Искусство позволяет передавать не только объективную информацию, но и эмоции, настроение, чувства.
Transformed data:
Средство коммуникации. В своей простой форме искусство является средством общения. Как и большинство других средств коммуникации, она намерена доводить информацию до сведения аудитории. Например, научная иллюстрация — тоже форма искусства, существующая для передачи информации. Ещё один пример такого рода — географические карты. Однако содержание послания не обязательно бывает научным. Искусство позволяет передавать не только объективную информацию, но и эмоции, настроение, чувства.
Custom Constraints
If the provided constraints are not enough, you can create your own ones by simple class inheritance. For example, to run transformation on the RWSD dataset (Shavrina et al,. 2020), we create an RWSDConstraint
:
from rutransform.constraints import Constraint
from rutransform.constraints.utils import parse_reference
from typing import List, Optional
from spacy.language import Language
class RWSDConstraint(Constraint):
def __init__(self, target_col_name: str, reference_key: str, noun_key: str) -> None:
super().__init__(name='rwsd_constraint')
self.target_col_name = target_col_name
self.reference_key = reference_key
self.noun_key = noun_key
def patterns(
self, text: Optional[dict], spacy_model: Optional[Language]
) -> List[List[dict]]:
morph = parse_reference(text[self.target_col_name][self.noun_key], spacy_model)
antecedent_feats = list(morph.values())
patterns = [
[{"TEXT": {"IN": text[self.target_col_name][self.reference_key].split() + text[self.target_col_name][self.noun_key].split()}}],
[{"POS": {"IN": ["NOUN", 'PROPN']}, "MORPH": {"IS_SUPERSET": antecedent_feats}}],
]
return patterns
To use custom constraints during the transformation, pass them into the custom_constraints
argument:
import pandas as pd
from rutransform.transformations import DatasetTransformer
from rutransform.utils.args import TransformArguments
# load data
dataset = pd.read_json('test_data/rwsd.json', lines=True)
# load arguments
transformation = "eda"
probability = 0.5
args = TransformArguments(transformation=transformation, probability=probability)
# init dataset transformer
tr = DatasetTransformer(
dataset=dataset,
text_col='text',
args=args,
custom_constraints=[
RWSDConstraint(
target_col_name='target', reference_key='span2_text', noun_key='span1_text'
)
],
return_type='pd' # format of the resulting dataset (default is 'hf')
)
# run transformation
output = tr.transform()
print('Target:', dataset['target'][0])
print('Original data:\n', dataset['text'][0],)
print('Transformed data:\n', output.transformed_dataset['text'][0])
Target: {'span1_text': 'статью', 'span2_text': 'читает ее', 'span1_index': 7, 'span2_index': 9}
Original data:
Сара взяла в библиотеке книгу, чтобы написать статью. Она читает ее, когда приходит с работы.
Transformed data:
Сара книгу , чтобы написать статью Она читает ее с работы .
Sentence Transformation
All of the transformations, supported by the framework, can be applied not only to the while datasets, but sentences alone.
from rutransform.transformations import (
SentenceAdditions,
ButterFingersTransformation,
EmojifyTransformation,
ChangeCharCase,
BackTranslationNER,
Paraphraser,
RandomEDA,
BAE
)
# initialize the transformations arguments, but you can leave out the transformation
args = TransformArguments(probability=0.5)
# transform the sentence
tr = SentenceAdditions(args=args)
tr.generate('мама мыла раму')
['мама мыла раму, Мама мыла раму,']
tr = ButterFingersTransformation(args=args,)
tr.generate('мама мыла раму')
['ммаа мырв ламу']
Custom Transformation
RuTransform allows one to create their own custom transformations. Here is the example of a simple transformation that randomises word order.
First, you need to define the SentenceOperation
class for the transformation, which has __init__
and generate
functions.
Note, that the function arguments must stay unchanged for further compatability with the framework. We also define a separate function for th transformation itself, to keep the code more readable.
import random
import spacy
from rutransform.transformations.utils import SentenceOperation
from typing import Optional, List, Union, Dict
def random_word_order(sentence, spacy_model, seed, max_outputs):
"""
Randomise word order
"""
random.seed(seed)
if not spacy_model:
spacy_model = spacy.load('ru_core_news_sm')
tokens = [token.text for token in spacy_model(sentence)]
return [' '.join(random.sample(tokens, k=len(tokens))) for _ in range(max_outputs)]
class RandomWordOrder(SentenceOperation):
def __init__(
self, args, seed=42,
max_outputs=1, device="cpu",
spacy_model=None,
):
super().__init__(
args=args,
seed=seed,
max_outputs=max_outputs,
device=device,
spacy_model=spacy_model,
)
def generate(
self,
sentence: str,
stop_words: Optional[List[Union[int, str]]] = None,
prob: Optional[float] = None,
) -> List[str]:
transformed = random_word_order(
sentence=sentence,
seed=self.seed,
spacy_model=self.spacy_model,
max_outputs=self.max_outputs
)
return transformed
Now the transformation is ready to use on the sentence level:
from rutransform.utils.args import TransformArguments
args = TransformArguments()
tr = RandomWordOrder(args=args, max_outputs=5)
tr.generate("мама мыла раму")
['раму мама мыла',
'раму мыла мама',
'мама раму мыла',
'раму мама мыла',
'мама раму мыла']
After creating the transformation, you can add it to an existing Transformer, by simply inheriting the class and changing the transform_info
fuction:
from rutransform.transformations import EDATransformer
class EDATransformer(EDATransformer):
def __init__(
self,
transformations: List[str],
task_type: str,
args: TransformArguments,
text_col: Optional[str] = "text",
label_col: Optional[str] = "label",
seed: int = 42,
device: str = "cpu",
constraints=None,
) -> None:
super().__init__(
transformations=transformations,
task_type=task_type,
args=args,
text_col=text_col,
label_col=label_col,
seed=seed,
device=device,
constraints=constraints
)
def transform_info() -> Dict[str, Optional[SentenceOperation]]:
info = {"eda": RandomEDA, "word_order": RandomWordOrder}
return info
...or create a Transformer from scratch by inheriting the Transformer
class and defining several functions:
transform_info
: a staticmethod, must return a dictionary {transformation name: corresponding SentenceOperation class}. It is used to load the list of all the available transformations_apply_transformation
: a function that applies the transformations to text until the transformed text passes the similarity threshold and returns a list of transformed texts and their similarity scorestransform
(optional): a function that takes a sentence as input and transforms it
For more information on the Transformer
class and its structure see here.
Once you have created the Transformer, add it to the rutransform/transformations/transformers folder and edit the __init__.py
file.
Now you transformation is ready for use!
Following the generally accepted typology (Zhang et al., 2020; Wang et al., 2021b), we divide the transformations included in the framework in two types, depending on their target.
Word-level perturbations utilize several strategies to perturb tokens, ranging from imitation of typos to synonym replacement:
Type | Transformation | Paper/Source | Main Idea | Example |
---|---|---|---|---|
Spelling | ButterFingers (butter_fingers ) |
(Dhole, 2021) | Adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance | This is a semtence ro test tje code |
Spelling | Case (case ) |
(Z. Wang, 2021) | Adds noise to data through case alteration | This is a sentence to tesT tHe cODe |
Modality | Emojify (emojify ) |
(Wang, 2021) | Replaces the input words with the corresponding emojis, preserving their original meaning | This is a sentence to test the 👩💻 |
Context | BAE (bae ) |
(Garg and Ramakrishnan, 2020) | Inserts and replaces tokens in the original text by masking a portion of the text and using masked LMs to generate contextually appropriate words | This is a sentence to test the given code |
In contrast to word-level perturbations, sentence-level perturbation techniques affect the syntactic structure:
Type | Transformation | Paper/Source | Main Idea | Example |
---|---|---|---|---|
Random | EDADELETE (eda ) |
(Wei and Zou, 2019) | Randomly deletes tokens in the text | This _ a sentence to test the code |
Random | EDASWAP (eda ) |
(Wei and Zou, 2019) | Randomly swaps tokens in the text | code is a sentence to test the This |
Paraphrasis | Paraphraser (paraphraser ) |
(Fenogenova, 2021) | Generates variations of the context via simple paraphrasing | I am just testing the code |
Paraphrasis | BackTranslation (back_translation ) |
(Yaseen and Langer, 2021) | Generates variations of the context through back-translation (ru -> en -> ru) | This sentence tests the code |
Distraction | AddSent (add_sent ) |
(Jia and Liang, 2017) | Generates extra words or a sentence at the end of the text. For multiple choice QA tasks it replaces one or more choice options with a generated one | This is a sentence to test the code , if you want to delete it |
The examples are given in English for illustration purposes.
The degree of the input modification can be controlled with an adversarial probability threshold, which serves as the hyperparameter. The higher the probability, the more the input gets modified.
The RuTransform's attacks and perturbations do not drastically change the input's meaning. Despite this, we consider the use of rule-based constraints that keep the linguistic structure and task-specific aspects unchanged. For instance, it is crucial to leave named entities in the QA tasks untouched and not modify the syntactic structure and anaphors when perturbing the coreference resolution task examples.
Name | Description | Additional Requirements | Example |
---|---|---|---|
Jeopardy |
Jeopardy type conatraints, including (1) Noun Phrases such as THIS FILM, THIS ACTOR, both UPPER and lower cased, (2) 'X', (3) «Named Entity in parentheses» | - | For the first time, THIS soda appeared in 1958 in Spain, the name of the drink is translated from the Esperanto language as “amazing”. |
NamedEntities |
Matches all the named entities in text | - | The singer from Turkey who impressed us all. |
Multihop |
Constraints for multihop QA tasks. Matches all the bridge and main answers important for hops | - | Question: Where is the source of the river, the tributary of which is the Getar, located? Supporting Text: The Getar is a river in Armenia. It originates in the Kotayk region, flows through the central part of Yerevan and flows into the Hrazdan. Main Text: The Hrazdan, a river in Armenia, is the left tributary of the Aras. It originates at the northwest extremity of Lake Sevan, near the city of Sevan. Answer: Sevan |
Referents |
Constraints for coreference resolution tasks. Matches (1) the anaphoric pronoun, (2) all possible antecedents (3) all verbs referring to antecedents and anaphor | Markup of the possible antecedents and anaphors | The singer from Turkey who impressed us all. |
We follow Wang et al., 2021 on filtering the adversarial examples with BERTScore (Zhang et al., 2019), a BERT-based text similarity metric (Devlin et al., 2019). We measure the semantic similarity between the original input and adversarial output and keep examples with the highest similarity score. In cases when the score is lower than a specified threshold, we iteratively decrease the adversarial probability threshold and re-score the new adversarial examples.
@article{taktasheva2022tape,
title={TAPE: Assessing Few-shot Russian Language Understanding},
author={Taktasheva, Ekaterina and Shavrina, Tatiana and Fenogenova, Alena and Shevelev, Denis and Katricheva, Nadezhda and Tikhonova, Maria and Akhmetgareeva, Albina and Zinkevich, Oleg and Bashmakova, Anastasiia and Iordanskaia, Svetlana and others},
journal={arXiv preprint arXiv:2210.12813},
year={2022}
}
All the code is available under the Apache 2.0 license.