This part of the tutorial shows how you can load a corpus for training a model. We assume that you're familiar with the base types of this library.
The Corpus
represents a dataset that you use to train a model. It consists of a list of train
sentences,
a list of dev
sentences, and a list of test
sentences, which correspond to the training, validation and testing
split during model training.
The following example snippet instantiates the Universal Dependency Treebank for English as a corpus object:
import flair.datasets
corpus = flair.datasets.UD_ENGLISH()
The first time you call this snippet, it triggers a download of the Universal Dependency Treebank for English onto your
hard drive. It then reads the train, test and dev splits into the Corpus
corpus which it returns. Check the length of
the three splits to see how many Sentences are there:
# print the number of Sentences in the train split
print(len(corpus.train))
# print the number of Sentences in the test split
print(len(corpus.test))
# print the number of Sentences in the dev split
print(len(corpus.dev))
You can also access the Sentence objects in each split directly. For instance, let us look at the first Sentence in the training split of the English UD:
# print the first Sentence in the training split
print(corpus.test[0])
This prints:
Sentence: "What if Google Morphed Into GoogleOS ?" - 7 Tokens
The sentence is fully tagged with syntactic and morphological information. For instance, print the sentence with PoS tags:
# print the first Sentence in the training split
print(corpus.test[0].to_tagged_string('pos'))
This should print:
What <WP> if <IN> Google <NNP> Morphed <VBD> Into <IN> GoogleOS <NNP> ? <.>
So the corpus is tagged and ready for training.
A Corpus
contains a bunch of useful helper functions.
For instance, you can downsample the data by calling downsample()
and passing a ratio. So, if you normally get a
corpus like this:
then you can downsample the corpus, simply like this:
import flair.datasets
downsampled_corpus = flair.datasets.UD_ENGLISH().downsample(0.1)
If you print both corpora, you see that the second one has been downsampled to 10% of the data.
print("--- 1 Original ---")
print(corpus)
print("--- 2 Downsampled ---")
print(downsampled_corpus)
This should print:
--- 1 Original ---
Corpus: 12543 train + 2002 dev + 2077 test sentences
--- 2 Downsampled ---
Corpus: 1255 train + 201 dev + 208 test sentences
For many learning tasks you need to create a target dictionary. Thus, the Corpus
enables you to create your
tag or label dictionary, depending on the task you want to learn. Simple execute the following code snippet to do so:
# create tag dictionary for a PoS task
corpus = flair.datasets.UD_ENGLISH()
print(corpus.make_tag_dictionary('upos'))
# create tag dictionary for an NER task
corpus = flair.datasets.CONLL_03_DUTCH()
print(corpus.make_tag_dictionary('ner'))
# create label dictionary for a text classification task
corpus = flair.datasets.TREC_6()
print(corpus.make_label_dictionary())
Another useful function is obtain_statistics()
which returns you a python dictionary with useful statistics about your
dataset. Using it, for example, on the IMDB dataset like this
import flair.datasets
corpus = flair.datasets.TREC_6()
stats = corpus.obtain_statistics()
print(stats)
outputs detailed information on the dataset, each split, and the distribution of class labels.
If you want to train multiple tasks at once, you can use the MultiCorpus
object.
To initiate the MultiCorpus
you first need to create any number of Corpus
objects. Afterwards, you can pass
a list of Corpus
to the MultiCorpus
object. For instance, the following snippet loads a combination corpus
consisting of the English, German and Dutch Universal Dependency Treebanks.
english_corpus = flair.datasets.UD_ENGLISH()
german_corpus = flair.datasets.UD_GERMAN()
dutch_corpus = flair.datasets.UD_DUTCH()
# make a multi corpus consisting of three UDs
from flair.data import MultiCorpus
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])
The MultiCorpus
inherits from Corpus
, so you can use it like any other corpus to train your models.
Flair supports a growing list of prepared datasets out of the box. That is, it automatically downloads and sets up the data the first time you call the corresponding constructor ID. The following datasets are supported:
ID(s) | Languages | Description |
---|---|---|
'CONLL_2000' | English | CoNLL-2000 syntactic chunking |
ID(s) | Languages | Description |
---|---|---|
'CONLL_03_DUTCH' | Dutch | CoNLL-03 4-class NER |
'CONLL_03_SPANISH' | Spanish | CoNLL-03 4-class NER |
'WNUT_17' | English | WNUT-17 emerging entity detection |
'WIKINER_ENGLISH' | English | WikiNER NER dataset automatically generated from Wikipedia |
'WIKINER_GERMAN' | German | WikiNER NER dataset automatically generated from Wikipedia |
'WIKINER_FRENCH' | French | WikiNER NER dataset automatically generated from Wikipedia |
'WIKINER_ITALIAN' | Italian | WikiNER NER dataset automatically generated from Wikipedia |
'WIKINER_SPANISH' | Spanish | WikiNER NER dataset automatically generated from Wikipedia |
'WIKINER_PORTUGUESE' | Portuguese | WikiNER NER dataset automatically generated from Wikipedia |
'WIKINER_POLISH' | Polish | WikiNER NER dataset automatically generated from Wikipedia |
'WIKINER_RUSSIAN' | Russian | WikiNER NER dataset automatically generated from Wikipedia |
'NER_BASQUE' | Basque | NER dataset for Basque |
ID(s) | Languages | Description |
---|---|---|
'UD_ARABIC' | Arabic | Universal Dependency Treebank for Arabic |
'UD_BASQUE' | Basque | Universal Dependency Treebank for Basque |
'UD_BULGARIAN' | Bulgarian | Universal Dependency Treebank for Bulgarian |
'UD_CATALAN', | Catalan | Universal Dependency Treebank for Catalan |
'UD_CHINESE' | Chinese | Universal Dependency Treebank for Chinese |
'UD_CROATIAN' | Croatian | Universal Dependency Treebank for Croatian |
'UD_CZECH' | Czech | Very large Universal Dependency Treebank for Czech |
'UD_DANISH' | Danish | Universal Dependency Treebank for Danish |
'UD_DUTCH' | Dutch | Universal Dependency Treebank for Dutch |
'UD_ENGLISH' | English | Universal Dependency Treebank for English |
'UD_FINNISH' | Finnish | Universal Dependency Treebank for Finnish |
'UD_FRENCH' | French | Universal Dependency Treebank for French |
'UD_GERMAN' | German | Universal Dependency Treebank for German |
'UD_GERMAN-HDT' | German | Very large Universal Dependency Treebank for German |
'UD_HEBREW' | Hebrew | Universal Dependency Treebank for Hebrew |
'UD_HINDI' | Hindi | Universal Dependency Treebank for Hindi |
'UD_INDONESIAN' | Indonesian | Universal Dependency Treebank for Indonesian |
'UD_ITALIAN' | Italian | Universal Dependency Treebank for Italian |
'UD_JAPANESE' | Japanese | Universal Dependency Treebank for Japanese |
'UD_KOREAN' | Korean | Universal Dependency Treebank for Korean |
'UD_NORWEGIAN', | Norwegian | Universal Dependency Treebank for Norwegian |
'UD_PERSIAN' | Persian / Farsi | Universal Dependency Treebank for Persian |
'UD_POLISH' | Polish | Universal Dependency Treebank for Polish |
'UD_PORTUGUESE' | Portuguese | Universal Dependency Treebank for Portuguese |
'UD_ROMANIAN' | Romanian | Universal Dependency Treebank for Romanian |
'UD_RUSSIAN' | Russian | Universal Dependency Treebank for Russian |
'UD_SERBIAN' | Serbian | Universal Dependency Treebank for Serbian |
'UD_SLOVAK' | Slovak | Universal Dependency Treebank for Slovak |
'UD_SLOVENIAN' | Slovenian | Universal Dependency Treebank for Slovenian |
'UD_SPANISH' | Spanish | Universal Dependency Treebank for Spanish |
'UD_SWEDISH' | Swedish | Universal Dependency Treebank for Swedish |
'UD_TURKISH' | Turkish | Universal Dependency Treebank for Tturkish |
ID(s) | Languages | Description |
---|---|---|
'IMDB' | English | IMDB dataset of movie reviews and sentiment |
'NEWSGROUPS' | English | The popular 20 newsgroups classification dataset |
'TREC_6', 'TREC_50' | English | The TREC question classification dataset |
ID(s) | Languages | Description |
---|---|---|
'WASSA_ANGER' | English | The WASSA emotion-intensity detection challenge (anger) |
'WASSA_FEAR' | English | The WASSA emotion-intensity detection challenge (fear) |
'WASSA_JOY' | English | The WASSA emotion-intensity detection challenge (joy) |
'WASSA_SADNESS' | English | The WASSA emotion-intensity detection challenge (sadness) |
ID(s) | Languages | Description |
---|---|---|
'FeideggerCorpus' | German | Feidegger dataset fashion images and German-language descriptions |
'OpusParallelCorpus' | Any language pair | Parallel corpora of the OPUS project, currently supports only Tatoeba corpus |
So to load the IMDB corpus for sentiment text classification, simply do:
import flair.datasets
corpus = flair.datasets.IMDB()
This downloads and sets up everything you need to train your model.
In cases you want to train over a sequence labeling dataset that is not in the above list, you can load them with the ColumnCorpus object. Most sequence labeling datasets in NLP use some sort of column format in which each line is a word and each column is one level of linguistic annotation. See for instance this sentence:
George N B-PER
Washington N I-PER
went V O
to P O
Washington N B-LOC
Sam N B-PER
Houston N I-PER
stayed V O
home N O
The first column is the word itself, the second coarse PoS tags, and the third BIO-annotated NER tags. Empty line separates sentences. To read such a
dataset, define the column structure as a dictionary and instantiate a ColumnCorpus
.
from flair.data import Corpus
from flair.datasets import ColumnCorpus
# define columns
columns = {0: 'text', 1: 'pos', 2: 'ner'}
# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data/folder'
# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
train_file='train.txt',
test_file='test.txt',
dev_file='dev.txt')
This gives you a Corpus
object that contains the train, dev and test splits, each has a list of Sentence
.
So, to check how many sentences there are in the training split, do
len(corpus.train)
You can also access a sentence and check out annotations. Lets assume that the training split is read from the example above, then executing these commands
print(corpus.train[0].to_tagged_string('ner'))
print(corpus.train[1].to_tagged_string('pos'))
will print the sentences with different layers of annotation:
George <B-PER> Washington <I-PER> went to Washington <B-LOC> .
Sam <N> Houston <N> stayed <V> home <N>
If you want to use your own text classification dataset, there are currently two methods to go about this: load specified text and labels from a simple CSV file or format your data to the FastText format.
Many text classification datasets are distributed as simple CSV files in which each row corresponds to a data point and
columns correspond to text, labels, and other metadata. You can load a CSV format classification dataset using
CSVClassificationCorpus
by passing in a column format (like in ColumnCorpus
above). This column format indicates
which column(s) in the CSV holds the text and which field(s) the label(s). By default, Python's CSV library assumes that
your files are in Excel CSV format, but you can specify additional parameters
if you use custom delimiters or quote characters.
Note: You will need to save your split CSV data files in the data_folder
path with each file titled appropriately i.e.
train.csv
test.csv
dev.csv
. This is because the corpus initializers will automatically search for the train,
dev, test splits in a folder.
from flair.data import Corpus
from flair.datasets import CSVClassificationCorpus
# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data'
# column format indicating which columns hold the text and label(s)
column_name_map = {4: "text", 1: "label_topic", 2: "label_subtopic"}
# load corpus containing training, test and dev data and if CSV has a header, you can skip it
corpus: Corpus = CSVClassificationCorpus(data_folder,
column_name_map,
skip_header=True,
delimiter='\t', # tab-separated files
)
If using CSVClassificationCorpus
is not practical, you may format your data to the FastText format, in which each line in the file represents a text document. A document can have one or multiple labels that are defined at the beginning of the line starting with the prefix __label__
. This looks like this:
__label__<label_1> <text>
__label__<label_1> __label__<label_2> <text>
As previously mentioned, to create a Corpus
for a text classification task, you need to have three files (train, dev, and test) in the
above format located in one folder. This data folder structure could, for example, look like this for the IMDB task:
/resources/tasks/imdb/train.txt
/resources/tasks/imdb/dev.txt
/resources/tasks/imdb/test.txt
Now create a ClassificationCorpus
by pointing to this folder (/resources/tasks/imdb
).
Thereby, each line in a file is converted to a Sentence
object annotated with the labels.
Attention: A text in a line can have multiple sentences. Thus, a Sentence
object can actually consist of multiple
sentences.
from flair.data import Corpus
from flair.datasets import ClassificationCorpus
# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data/folder'
# load corpus containing training, test and dev data
corpus: Corpus = ClassificationCorpus(data_folder,
test_file='test.txt',
dev_file='dev.txt',
train_file='train.txt')
Note again that our corpus initializers have methods to automatically look for train, dev and test splits in a folder. So in most cases you don't need to specify the file names yourself. Often, this is enough:
# this is the folder in which train, test and dev files reside
data_folder = '/path/to/data/folder'
# load corpus by pointing to folder. Train, dev and test gets identified automatically.
corpus: Corpus = ClassificationCorpus(data_folder)
You can now look into training your own models.