layout | title |
---|---|
page |
Getting Started |
If you are looking to use the MuMiN dataset in your research or work, this page is here to get you started.
The dataset is built using our Python package, mumin
. To install this, write
the following in your terminal:
$ pip install mumin
If you want to be able to add embeddings of the tweets and images, you need to
add on the [embeddings]
extras:
$ pip install mumin[embeddings]
Further, if you're interested in exporting to the Deep Graph Library, then add
on the [dgl]
extras:
$ pip install mumin[dgl]
You can add on all extras using the [all]
extras:
$ pip install mumin[all]
With the mumin
package installed, you compile the dataset in a Python script
containing the following:
>>> from mumin import MuminDataset
>>> dataset = MuminDataset(bearer_token, size='small')
>>> dataset.compile()
MuminDataset(num_nodes=388,149, num_relations=475,490, size='small', compiled=True)
To be able to compile the dataset, data from Twitter needs to be downloaded, which requires a Twitter API key. You can get one for free here. You will need the Bearer Token (bearer_token).
Note that this dataset does not contain all the nodes and relations in
MuMiN-small, as that would take way longer to compile. The data left out are
timelines, profile pictures and article images. These can be included by
specifying include_extra_images=True
and/or include_timelines=True
in the
constructor of MuminDataset
.
With a compiled dataset, you can now work directly with the individual nodes
and relations using the dataset.nodes
and dataset.rels
dictionaries. For
instance, you can get a dataframe with all the claims as follows:
>>> claim_df = dataset.nodes['claim']
>>> claim_df.columns
Index(['embedding', 'label', 'reviewers', 'date', 'language', 'keywords',
'cluster_keywords', 'cluster', 'train_mask', 'val_mask', 'test_mask'],
dtype='object')
All the relations are dataframes with two columns, src
and tgt
,
corresponding to the source and target of the relation. For instance, if we're
interested in the relation (:Tweet)-[:DISCUSSES]->(:Claim)
then we can
extract this as follows:
>>> discusses_df = dataset.rels[('tweet', 'discusses', 'claim')]
>>> discusses_df.head()
src tgt
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
If you are interested in computing transformer embeddings of the tweets and images then run the following:
>>> dataset.add_embeddings()
MuminDataset(num_nodes=388,149, num_relations=475,490, size='small', compiled=True)
From a compiled dataset, with or without embeddings, you can export the dataset to a Deep Graph Library heterogeneous graph object, which allows you to use graph machine learning algorithms on the dataset. To export it, you simple run:
>>> dgl_graph = dataset.to_dgl()
>>> type(dgl_graph)
dgl.heterograph.DGLHeteroGraph
We have created a tutorial which takes you through the dataset as well as shows how one could create several kinds of misinformation classifiers on the dataset. The tutorial can be found here: https://colab.research.google.com/drive/1JCjgg3moGBOuZk4iVjBpQNqgsAYFyNoS