A labeller for news articles trained on the NYT annotated corpus by Jasmin Rubinovitz as part of the MIT Media Lab SuperGlue project. Give it the clean text of a story (i.e. no html content), and it returns various descriptors and taxonomic classifiers based on models trained on the tagging in the NYT corpus.
Note - we have not formally assessed these models for embedded bias. Surely they have many, because they are based on the Google News word2vec model and New York Times historical tagging. Be aware as you use results that they likely reflect historical American cultural biases in news reporting.
We use it in the Media Cloud project to automatically tag all news stories with the themes we think they are about.
The quickest path to running this is to fetch the latest release from DockerHub:
docker pull rahulbot/nyt-news-labeler:latest
docker run -p 8000:8000 -m 8G -d rahulbot/nyt-news-labeler:latest
Then just hit a http://localhost:8080/
to test it out.
This is an old set of code, so it can be hard to install and run locally. The serialization of the models is tied to a specific version of ternsorflow, which can make this hard to install as well.
- Install Python 3.7 (See
scripts/setup-mac-os.sh
for tips) - Install python requirements:
pip install -r requirements.txt
- Install brotli:
brew install brotli
(on MacOS) - Download the models:
download_models.py
(this will take 10+ minutes, depending on your internet speed)
Run ./run.sh
. Note: this consumes about 8 GB of memory while running, to keep all the models loaded up.
This exposes a simple web UI to make testing easier. Visit localhost:8000/
to try it out. You can paste any
raw text in, and click "Get Labels". In a second you will see the top 30 labels from each model below the input.
For batch processing this exposes a simple API. You can make a request like this:
curl -X POST http://localhost:8000/predict.json -H "Content-Type: application/json" -d '{"text": "Federal agents show stronger force at Portland protests despite order to withdraw" }'
You will get back results like this:
{
"milliseconds":77.39500000000001,
"predictions":{
"allDescriptors":[
{
"label":"demonstrations and riots",
"score":"0.28221"
},
{
"label":"politics and government",
"score":"0.03751"
},
...
],
"descriptors3000":[
{
"label":"company reports",
"score":"0.74512"
},
{
"label":"demonstrations and riots",
"score":"0.64673"
},
...
],
"descriptors600":[
{
"label":"demonstrations and riots",
"score":"0.65299"
},
{
"label":"politics and government",
"score":"0.09620"
},
...
],
"descriptorsAndTaxonomies":[
{
"label":"demonstrations and riots",
"score":"0.43143"
},
{
"label":"top/news",
"score":"0.27492"
},
...
],
"taxonomies":[
{
"label":"Top/Features/Travel/Guides/Destinations/North America/United States/Oregon",
"score":"0.35107"
},
{
"label":"Top/News",
"score":"0.18331"
},
...
]
},
"status":"OK",
"version":"1.1.0"
}
When you creating a new release, be sure to increment the VERSION
constant in app.py
. Then tag the repo with the
same number.
I build and release this to DockerHub for easier deployment on your server. To release the latest code I run:
docker build -t rahulbot/nyt-news-labeler .
docker push rahulbot/nyt-news-labeler
To release a tagged version, I something like this run:
docker build -t rahulbot/nyt-news-labeler:1.1.0 .
docker push rahulbot/nyt-news-labeler:1.1.0
To run a container I've built locally I do:
docker run -p 8000:8000 -m 8G rahulbot/nyt-news-labeler