Skip to content

Latest commit

 

History

History
116 lines (88 loc) · 3.45 KB

gettingstarted.md

File metadata and controls

116 lines (88 loc) · 3.45 KB
layout title
page
Getting Started

If you are looking to use the MuMiN dataset in your research or work, this page is here to get you started.

Installation of the mumin package

The dataset is built using our Python package, mumin. To install this, write the following in your terminal:

$ pip install mumin

If you want to be able to add embeddings of the tweets and images, you need to add on the [embeddings] extras:

$ pip install mumin[embeddings]

Further, if you're interested in exporting to the Deep Graph Library, then add on the [dgl] extras:

$ pip install mumin[dgl]

You can add on all extras using the [all] extras:

$ pip install mumin[all]

Compiling and using the MuMiN dataset

With the mumin package installed, you compile the dataset in a Python script containing the following:

>>> from mumin import MuminDataset
>>> dataset = MuminDataset(bearer_token, size='small')
>>> dataset.compile()
MuminDataset(num_nodes=388,149, num_relations=475,490, size='small', compiled=True)

To be able to compile the dataset, data from Twitter needs to be downloaded, which requires a Twitter API key. You can get one for free here. You will need the Bearer Token (bearer_token).

Note that this dataset does not contain all the nodes and relations in MuMiN-small, as that would take way longer to compile. The data left out are timelines, profile pictures and article images. These can be included by specifying include_extra_images=True and/or include_timelines=True in the constructor of MuminDataset.

With a compiled dataset, you can now work directly with the individual nodes and relations using the dataset.nodes and dataset.rels dictionaries. For instance, you can get a dataframe with all the claims as follows:

>>> claim_df = dataset.nodes['claim']
>>> claim_df.columns
Index(['embedding', 'label', 'reviewers', 'date', 'language', 'keywords',
       'cluster_keywords', 'cluster', 'train_mask', 'val_mask', 'test_mask'],
      dtype='object')

All the relations are dataframes with two columns, src and tgt, corresponding to the source and target of the relation. For instance, if we're interested in the relation (:Tweet)-[:DISCUSSES]->(:Claim) then we can extract this as follows:

>>> discusses_df = dataset.rels[('tweet', 'discusses', 'claim')]
>>> discusses_df.head()
   src  tgt
0    0    0
1    1    1
2    2    1
3    3    1
4    4    1

Using embeddings and exporting to dgl

If you are interested in computing transformer embeddings of the tweets and images then run the following:

>>> dataset.add_embeddings()
MuminDataset(num_nodes=388,149, num_relations=475,490, size='small', compiled=True)

From a compiled dataset, with or without embeddings, you can export the dataset to a Deep Graph Library heterogeneous graph object, which allows you to use graph machine learning algorithms on the dataset. To export it, you simple run:

>>> dgl_graph = dataset.to_dgl()
>>> type(dgl_graph)
dgl.heterograph.DGLHeteroGraph

Tutorial

We have created a tutorial which takes you through the dataset as well as shows how one could create several kinds of misinformation classifiers on the dataset. The tutorial can be found here: https://colab.research.google.com/drive/1JCjgg3moGBOuZk4iVjBpQNqgsAYFyNoS