Project: Clustering and Classifying Articles based on Text and knowledge-base information

UE 803: Data Science

Project: Clustering and Classifying Articles based on Text and knowledge-base information

About this project

In this project, we need to collect information about articles belonging to different categories (such as airports, artists, politicians, sportspeople, etc). Based on this information, we will try to automatically cluster and classify these articles into the correct categories. The source of information we are using are:

Wikipedia Online Encyclopedia
Wikidata knowledge base

Installation

Note: Following code has been implemented in Python3

cd to the directory where requirements.txt is located;
run: pip install -r requirements.txt .

How to execute the code

In order to execute the code from this repository and get the results, the following commands need to be run:

python data_extraction.py num_articles 100 num_sentences 5

python preprocessing_nltk.py --input /extracted_data.csv/ --output data/preprocessed_data.csv/

python clustering.py --input data/preprocessed_data.csv --output data/

python category_classification.py --inputpath data/preprocessed_data.csv

General Overview

The project constitues of four steps, namely:

Corpus Extraction
Pre-Processing
Clustering
Classifying

Corpus Extraction

This step extracts information about articles from wikipedia and wikidata. These articles belong to the categories:

Airports
Artists
Astronauts
Building
Astronomical_objects
City
Comics_characters
Companies
Foods
Transport
Monuments_and_memorials
Politicians
Sports_teams
Sportspeople
Universities_and_colleges
Written_communication'.

The script used by the Corpus Extraction gets k (num_articles param) articles from each category and stores the following features such as: description, page content (where the number n of sentences is represented by the num_sentences param) , infobox, wikidata statements (triples). The result is stored in the data/extracted_data.csv.

Pre-Processing

This step takes, as an input, the csv file (data/extracted_data.csv) generated by the previous step. The data is processed using the following steps:

Tokenize the text
Lowercase the tokens
Remove punctuation and function words
Remove stop words

The preprocessed output is stored in data/preprocessed_data.csv

Clustering

This step takes the data/preprocessed_data.csv as input which is going to be used to train the KMeans algorithm (--input param). The method used in order to process the data is:

Tf-idk

The results obtained are stored in data/Clustering results.csv (which shows a comparison table) and data/Clustering visualization.png (which shows a comparison plot).

Classifying

This step takes the data/preprocessed_data.csv as input which is going to be used to train the Perceptron algorithm. The method used to preprocess the data is Tf-Idf Vectorizer. The results obtained are stored in the data directory. The files obtatined are:

Confusion matrix 16 classes.png with the confusion matrix for the 16 class classification.
Scores 16 classes.csv with the Precision, Recall, F1-Score and Support scores for the 16 class classification

Results

The results are obtained on the data stored in the data directory:

Clustering:

16clust.	15clust.	14clust.	13clust.	12clust.	11clust.	10clust.	9clust.	8clust.	7clust.	6clust.	5clust.	4clust.	3clust.	2clust.
0.6680916223497617	0.6983192400534757	0.6696884923205174	0.5906099312548578	0.6203303978577499	0.5870465175217298	0.5574005499406839	0.5437336837606537	0.4829662450603848	0.4572132727484873	0.38920056491263944	0.32889613864546835	0.26324447104990567	0.17971181188393984	0.10768322691491419
0.7058643137837441	0.725532305195805	0.754295664076981	0.7083097655869558	0.7348225249553275	0.7513919747278484	0.7560636464252114	0.734479465375821	0.7404350905295765	0.7572583291106593	0.7177446756717246	0.6972722112277343	0.7069819947325824	0.6561038446101938	0.6751243629217271
0.686458746095857	0.7116657206139825	0.7094785763874869	0.644127243551073	0.6727399418718589	0.659129343054648	0.6417080777284814	0.6248742248116411	0.5846080840046943	0.5701715189366625	0.5047162641704035	0.4469639662652354	0.3836405371504451	0.28214262267709606	0.18574058532942533
0.42997955192211296	0.5637443909527183	0.4389854197472739	0.3497149795036718	0.4168178664403366	0.34107045621984067	0.3108731459727958	0.3337065053084658	0.23979809632333296	0.2193361190743175	0.17080162974321872	0.1311890072602876	0.08791649467791553	0.052895912354631265	0.03052442766771767
0.09795952708393732	0.09715241698108781	0.09428528677153968	0.09372228136789948	0.0906886180163463	0.08619948105220934	0.08340299746693836	0.08255049231214943	0.08114423009222613	0.0801217518314597	0.0731632680503235	0.06620950570426505	0.05949654226982902	0.0517310929017951	0.0392090136351276

Classification:

16 class classification:

category	precision	recall	f1-score	support
Airports	1.0	0.75	0.86	28.0
Artists	0.86	0.86	0.86	14.0
Astronauts	0.77	1.0	0.87	17.0
Astronomical_objects	1.0	1.0	1.0	9.0
Building	0.73	0.73	0.73	11.0
City	1.0	0.95	0.97	20.0
Comics_characters	0.81	0.93	0.87	14.0
Companies	0.83	0.71	0.77	7.0
Foods	0.87	1.0	0.93	13.0
Monuments_and_memorials	1.0	0.89	0.94	9.0
Politicians	0.86	0.67	0.75	9.0
Sports_teams	0.71	0.59	0.64	29.0
Sportspeople	0.38	0.44	0.41	18.0
Transport	0.75	0.75	0.75	8.0
Universities_and_colleges	0.55	0.75	0.63	8.0
Written_communication	0.67	1.0	0.8	4.0
accuracy	0.79	0.79	0.79	0.79
macro avg	0.8	0.81	0.8	218.0
weighted avg	0.81	0.79	0.79	218.0

Bonus Analysis

We did a visualisation on the confusion matrix to identify which category was misidentified and to what. This shows us the reason for low overall scores for sportspeople. Out of 17 test examples for Sportspeople, 9 were identified as Sports_teams category. This is because the word correlation of text from sportspeople with Sports_teams was also probable. Since number of training samples were more in sports_team, so it labelled it sports_team.

Created By

Sahil Chopra
Jorge Vasquez
Colm Rooney

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UE 803: Data Science

Project: Clustering and Classifying Articles based on Text and knowledge-base information

About this project

Installation

How to execute the code

General Overview

Corpus Extraction

Pre-Processing

Clustering

Classifying

Results

Bonus Analysis

Created By

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
data		data
LICENSE		LICENSE
README.md		README.md
category_classification.py		category_classification.py
clustering.py		clustering.py
data_extraction.py		data_extraction.py
preprocessing_nltk.py		preprocessing_nltk.py
requirements.txt		requirements.txt

License

schopra6/TopicModelling_wikidata

Folders and files

Latest commit

History

Repository files navigation

UE 803: Data Science

Project: Clustering and Classifying Articles based on Text and knowledge-base information

About this project

Installation

How to execute the code

General Overview

Corpus Extraction

Pre-Processing

Clustering

Classifying

Results

Bonus Analysis

Created By

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages