Signal-1M-Tools

What is the Signal 1M Dataset?

The Signal Media One-Million News Articles Dataset dataset by Signal Media was released to facilitate conducting research on news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.

The articles of the dataset were originally collected by Moreover Technologies (one of Signal's content providers) from a variety of news sources for a period of 1 month (1-30 September 2015). It contains 1 million articles that are mainly English, but they also include non-English and multi-lingual articles. Sources of these articles include major ones, such as Reuters, in addition to local news sources and blogs.

Getting Started

Downloading the dataset

To obtain the dataset, follow the download link here.

Elasticsearch

Elasticsearch is a powerful distributed RESTful search engine that can be used to store and index large amounts of data. At Signal, we use Elasticsearch to handle most of our search requests.

Installation

Download Elasticsearch and unzip.
Run bin/elasticsearch on Unix or bin/elasticsearch.bat on Windows.
Run curl -X GET http://localhost:9200/

At this point, Elasticsearch should be running locally on port 9200. More information about Elasticsearch can be found at their GitHub page.

We advise that you use a tool to interact with Elasticsearch. Here are a few good ones:

Creating an index

In order to store articles, you need to create an index. First, create an articles index:

curl -X PUT 'http://localhost:9200/articles'

or in Sense:

PUT articles

Indexing the million articles

To index the million articles into Elasticsearch using python, first install Requests:

pip install requests

Then run:

python index_articles.py http://localhost:9200 ./million.jsonl

Term frequencies

The term and document frequencies are also available using these links. These values were calculated after routine tokenisation and stop-word removal.

These files are in edn format.

TREC

Signal-1M-Convert-To-TREC

A script to convert the Signal Media One-Million News Articles Dataset to TREC format. The TREC format allows researchers to index the dataset using popular Information Retrieval platforms such as http://terrier.org

Running the script

After obtaining the dataset through this form http://goo.gl/forms/5i4KldoWIX, you can extract the JSONL file from the the downloaded Gzip file Then you run the script like this

python convert-to-trec.py -i <path to signalmedia-1m.jsonl> -o <path to your outputfile>

Indexing the dataset with Terrier

We recommend using the terrier.properties file included in this repository to index the dataset with Terrier. In your Terrier etc folder, add a text file "signal.spec" with one line containing the path to the file you created above (The TREC formatted dataset)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
convert-to-trec.py		convert-to-trec.py
index_articles.py		index_articles.py
terrier.properties		terrier.properties

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Signal-1M-Tools

What is the Signal 1M Dataset?

Getting Started

Downloading the dataset

Elasticsearch

Installation

Creating an index

Indexing the million articles

Term frequencies

TREC

Signal-1M-Convert-To-TREC

Running the script

Indexing the dataset with Terrier

About

Releases

Packages

Contributors 4

Languages

signal-ai/Signal-1M-Tools

Folders and files

Latest commit

History

Repository files navigation

Signal-1M-Tools

What is the Signal 1M Dataset?

Getting Started

Downloading the dataset

Elasticsearch

Installation

Creating an index

Indexing the million articles

Term frequencies

TREC

Signal-1M-Convert-To-TREC

Running the script

Indexing the dataset with Terrier

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages