Skip to content

soomroha/Multi-Class-Text-Classification-Analysis

Repository files navigation

Multi-Class-Text-Classification-Analysis

Project: Analyze the performance of algorithms that classify news headlines into 4 classes.

• Input: TITLE
o Example: " Bitcoin exchange seeks U.S. bankruptcy protection"

• Output of classification algorithm: CATEGORY
o Example: Business

Loading and cleaning the dataset

df = pd.read_csv('headlines.csv')
df = df[['CATEGORY','TITLE']]
df = df[pd.notnull(df['TITLE'])]
df.columns = ['CATEGORY', 'TITLE']
df.TITLE = df.TITLE.apply(lambda x: x.lower())
df.TITLE = df.TITLE.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
df.TITLE = df.TITLE.apply(lambda x: x.translate(str.maketrans('', '', '1234567890')))
df['category_id'] = df['CATEGORY'].factorize()[0]

Top 5 Features by Category

figure_1

Mean Accuracy and Standard Deviation of the Algorithms

Logistic Regression: Mean Accuracy: 0.847214 Standard Deviation: 0.046154

Random Forest: Mean Accuracy: 0.361110 Standard Deviation: 0.098128

Naive Bayes: Mean Accuracy: 0.855489 Standard Deviation: 0.038743

Linear SVC: Mean Accuracy: 0.849241 Standard Deviation: 0.045773

figure_1

Confusion Matrices of the Algorithms

figure_1

figure_1

figure_1

figure_1

References:

https://towardsdatascience.com/a-production-ready-multi-class-text-classifier-96490408757 https://buhrmann.github.io/tfidf-analysis.html

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages