News ML

Introduction

This application implements a REST API for News management. This REST API provides different functionality:

Collects news information from News API and other News sources.
Search news by keyword, date.
Extract entities from News content via Google Cloud NLP and perform sentiment analysis
Send Email report
Use Twitter to tweet News articles
Generate News content based on News headlines.

Quickstart

You can use this sample Python script. Collect News and stores them into a CSV file using News API Key.

Architecture

I created an API which is able to handle requests to collect News. The API stack is composed of the following modules:

Ngnix (Main load balancer, terminates HTTPS) (Optional)
Gunicorn (WSGI)
Flask (Python Web App)
RabbitMQ (Message Queue)
PostgreSQL (Relational Database)

Docker containers installation

A group of containers are available to help you deploy the application faster. If you are interested in setting up the application manually please go to the Full Installation section.

Take a look at Docker configuration for more information about how to run this application using containers. You will need:

Docker API server
Docker RabbitMQ
Docker Load Balancer
External PostgreSQL database (Example: Google Cloud SQL w/proxy)

Full installation

To perform a manual installation follow the next steps:

Software requirements

Python based API:

Flask
Gunicorn
Celery
RabbitMQ
PostgreSQL
Ngnix
Google Cloud NLP

System Design

Requirements

Deploy a new Compute Engine instance using Ubuntu 16+.

Install the following software:

sudo apt-get install python build-essential  -y
sudo apt-get install libpq-dev python-dev -y   # Required for psycopg2
sudo apt-get install git -y
sudo apt-get install python3-pip -y
sudo apt-get install rabbitmq-server -y

Clone GitHub repo

cd /usr/local/src
git clone https://github.com/newsml/newsml.git
cd newsml

Install dependencies

pip3 install -r requirements.txt

Install NLTK dependencies

Single line command:

python3 -c 'import nltk; nltk.download("stopwords"); nltk.download("punkt")'

Python Terminal:

python
>>>import nltk
>>>nltk.download("stopwords")
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True
>>>nltk.download("punkt")
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

Database information

You need to create a new Database based on PostgreSQL server:

Database creation script.
Database schema script.

Using Google Cloud SQL proxy

./cloud_sql_proxy -instances=<>Project>:<Zone>:<Instance name>=tcp:5432

Using Using Google Cloud SQL proxy Docker container

docker run -d \
  -v <PATH_TO_KEY_FILE>:/config \
  -p 127.0.0.1:5432:5432 \
  gcr.io/cloudsql-docker/gce-proxy:1.16 /cloud_sql_proxy \
  -instances=<INSTANCE_CONNECTION_NAME>=tcp:0.0.0.0:5432 -credential_file=/config

RabbitMQ

RabbitMQ is an open-source message-broker. Is used to handle Asyncronous requests. Start RabbitMQ server:

/usr/local/sbin/rabbitmq-server

rabbitmqctl add_user news_ml news_ml
rabbitmqctl set_user_tags news_ml administrator
rabbitmqctl set_permissions -p / news_ml ".*" ".*" ".*"

Celery

Start Celery and verify RabbitMQ tasks are successful.

export RABBITMQ_USER=news_ml
export RABBITMQ_PASSWORD=news_ml
export RABBITMQ_HOSTNAME=rabbitmq
export RABBITMQ_PORT=5672

cd /usr/local/src/news_ml/conf/
celery worker -n 1 -P processes -c 15 --loglevel=DEBUG -Ofair

Environment variables

Update these parameters accordingly in the following files:

vim ~/.bashrc
vim ~/.profile

Change the following variables based on your settings:

export NEWSML_ENV="/usr/local/src/news_ml/"

Configure Database parameters:

export DBHOST=127.0.0.1
export DBPORT=5432
export DBUSERNAME="postgres"
export DBPASSWORD="postgres"
export DBNAME="newsml"

# NEWS API
export NEWS_API_KEY=""  # Change this www.newsapi.org

# System API information. 
export API_USERNAME="AC64861838b417b555d1c8868705e4453f" 
export API_PASSWORD="YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o" 

# Key for Email support from mailgun.com
export MAILGUN_API_KEY="key-"  # Change this
export MAILGUN_DOMAIN=""

# Key used for encrypting user information.
export SECRET_FERNET_KEY=""  # Change this

Note: To generate SECRET_FERNET_KEY can be generated as follows:

>>> from cryptography.fernet import Fernet
>>> key = Fernet.generate_key()
>>> f = Fernet(key)
>>> token = f.encrypt(b"YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o")
>>> token

Use token value

Database schema generation

How to generate schema? Please read here: https://pydigger.com/pypi/sqlacodegen

postgresql://username:password@hostname/database

Configure API settings

Depending on the path where you clone the repo you may need to edit the file.

Define NEWSML_ENV:

if platform.system() == 'Linux':
    filepath = '/usr/local/src/news_ml/'
else:
    filepath = '/Users/user/Documents/Development/news/'

export GUNICORN_LOGFILE=/tmp/gunicorn.log
export API_PORT=8081
export NUM_WORKERS=1
export TIMEOUT=60
export WORKER_CONNECTIONS=1000
export BACKLOG=500
export LOG_LEVEL=DEBUG

Start API server

cd /usr/src/src/news_ml/api/version1_0
gunicorn news_ml:api_app --bind 0.0.0.0:$API_PORT --log-level=$LOG_LEVEL --log-file=$GUNICORN_LOGFILE --workers $NUM_WORKERS --worker-connections=$WORKER_CONNECTIONS --backlog=$BACKLOG --timeout $TIMEOUT &

Supported API endpoints

/api/1.0/status
/api/1.0/campaign
/api/1.0/clustering
/api/1.0/person
/api/1.0/news
/api/1.0/rank

Example:

Check API status

Local Authentication:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o -H "Content-Type: application/json" http://0.0.0.0:8081/api/1.0/

Database authentication:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o http://0.0.0.0:8081/api/1.0/status

Request News from NEWS API:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o -H "Content-Type: application/json" -X POST -d '{ "provider": "news_api"}' http://0.0.0.0:8081/api/1.0/campaign

Request News from NEWS API using a Report:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o -H "Content-Type: application/json" -X POST -d '{ "provider": "news_api", "report": {"email": "[email protected]"}}' http://0.0.0.0:8081/api/1.0/campaign

Request News from NEWS API and Tweet:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o -H "Content-Type: application/json" -X POST -d '{ "provider": "news_api", "report": {"twitter": {"delay": 1, "add_hashtags": True}}}' http://0.0.0.0:8081/api/1.0/campaign

Search for news including 'tensorflow, keras and sagemaker' from NEWS API:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o -H "Content-Type: application/json" -X POST -d '{ "provider": "news_api", "query": "tensorflow, sagemaker, keras"}' http://0.0.0.0:8081/api/1.0/campaign

Request News from TechMeme:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o -H "Content-Type: application/json" -X POST -d '{ "provider": "techmeme"}' http://0.0.0.0:8081/api/1.0/campaign

Read for existing news:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o -H "Content-Type: application/json" http://0.0.0.0:8081/api/1.0/news

Read for news from amazon.com:

curl -u AC64861838b417b555d1c8868705e4453f:YYPKpbIAYqz90oMN8A11YYPKpbIAYqz90o -H "Content-Type: application/json" http://0.0.0.0:8081/api/1.0/news?source=amazon.com

Note: If using zsh add: curl -w '\n' to avoid % at the end of response

API performance

Use Apache ab tool to measure API performance.

ab -n 10 -A username:password https://<API_HOSTNAME>/api/1.0/status

Create API Users

curl -H "Content-Type: application/json" 
     -X POST -d '{"username":"[email protected]", "password":"54321"}' 
     http://0.0.0.0:8080/api/1.0/users

Table api_users is the following:

CREATE TABLE api_users
(
    id              serial primary key,
    username        VARCHAR(256) unique not null,
	password_hash   VARCHAR(256) not null,
    created         timestamp(6) WITH TIME ZONE
);

Token authentication

There is a good chance for Man in the middle attack while requesting data. In case of HTTP Basic Authentication, user credentials are sent along with every request which can be breached. So we use token based authentication, users request for a token and in the subsequent requests users use the obtained token to access data. This token lives for a short span of time, even if the attacker manages to get hold of the token, its only valid for short span of time. This way we can add one more layer of security to the REST API.

Manage NewsML Services via supervisor

pip3 install supervisor

Add the following environment variables:

export NEWSML_ENV="/usr/local/src/news_ml/"
export C_FORCE_ROOT="true"

to the following files:

vim ~/.bashrc
vim ~/.profile

mkdir -p /etc/supervisor/conf.d
mkdir /var/log/supervisor/

cp /usr/local/bin/supervisorctl /usr/bin/
cp /usr/local/bin/supervisord /usr/bin/
cp /usr/local/src/news_ml/conf/supervisor/celeryd.conf /etc/supervisor/conf.d
cp /usr/local/src/news_ml/conf/supervisor/supervisord.conf /etc/supervisor/

Start supervisor after reboot:

supervisord -c /etc/supervisor/supervisord.conf

Use supervisorctl to check services status.

Upgrades

cd /usr/local/src/news_ml
git pull

supervisorctl restart all
supervisorctl status

Load balancer (Optional)

Ngnix acts as our API load balancer.

    apt-get update
    locale-gen en_US en_US.UTF-8
    apt-get install -y nano vim wget dialog net-tools
    apt-get install -y nginx nginx-common nginx-extras
    vim /etc/nginx/sites-available/default
    vim nginx.conf

Ranking algorithm

    def rank(self):
        """Assign score based on source.
        Sources defined in settings file.

        :return:
        """
    
        try:
            self._ranking_source = settings.RANKING_SOURCES.index(self._source)
        except ValueError:
            self._ranking_source = settings.UNKNOWN_SOURCE_SCORE

        try:
            self._ranking_provider = settings.RANKING_PROVIDERS.index(self._provider)
        except ValueError:
            self._ranking_provider = settings.UNKNOWN_PROVIDER_SCORE

        # Articles which are read first are prioritized. Divide 100 by weight to prioritize higher values
        self.score += 20 // self._ranking_source
        self.score += 30 // self._ranking_provider
        self.score += self.order + random.randrange(0, 10)

Cronjob

You can use a cronjob to generate a report every 6 hours. Example:

crontab -e
0 */6 * * * /usr/local/src/news_ml/utils/scripts/get_news.sh

Troubleshooting

Problem: Supervisor not starting services.

Solution: Validate permissions for .sh scripts (Executable)

--

Problem: Supervisor not starting API service.

Solution: Check gunicorn gunicorn --log-file=- news_ml:api_app

--

Problem: Supervisor not starting Celery service.

Solution: Validate RabbitMQ is started

--

Problem: Can't start services.

Solution: Install nltk dependencies from root.

--

Problem: Can't start services.

Solution: sudo -i, verify .bashrc and .profile.

Questions?

Bugs and issues can be reported at support [at] newsml [dot] io

Name		Name	Last commit message	Last commit date
Latest commit History 167 Commits
.github		.github
api		api
conf		conf
error		error
main		main
mini		mini
services		services
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
contributors.txt		contributors.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News ML

Introduction

Quickstart

Architecture

Docker containers installation

Full installation

Software requirements

System Design

Requirements

Database information

Using Google Cloud SQL proxy

Using Using Google Cloud SQL proxy Docker container

RabbitMQ

Celery

Environment variables

Database schema generation

Configure API settings

Supported API endpoints

Example:

API performance

Create API Users

Token authentication

Manage NewsML Services via supervisor

Upgrades

Load balancer (Optional)

Ranking algorithm

Cronjob

Troubleshooting

Questions?

About

Releases

Packages

Contributors 3

Languages

License

gogasca/news_ml

Folders and files

Latest commit

History

Repository files navigation

News ML

Introduction

Quickstart

Architecture

Docker containers installation

Full installation

Software requirements

System Design

Requirements

Database information

Using Google Cloud SQL proxy

Using Using Google Cloud SQL proxy Docker container

RabbitMQ

Celery

Environment variables

Database schema generation

Configure API settings

Supported API endpoints

Example:

API performance

Create API Users

Token authentication

Manage NewsML Services via supervisor

Upgrades

Load balancer (Optional)

Ranking algorithm

Cronjob

Troubleshooting

Questions?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages