The DeepTweets Classification project! This project was developed as part of a competition that involved classifying tweets into two categories: Politics and Sports. This README provides an overview of the project, its functionality, and the code used to achieve this classification.
This project is centered around a tweet classification task where tweets are categorized into two classes: Politics and Sports. The goal is to correctly classify tweets into one of these two categories. The primary evaluation metric used in this competition is classification accuracy.
...
- Pandas: Data manipulation and analysis library.
- Numpy: Scientific computing library for numerical operations.
- Matplotlib: Visualization library for creating plots and charts.
- Seaborn: Data visualization library based on Matplotlib.
- NLTK: Natural Language Toolkit for text processing and analysis.
- Scikit-learn: Machine learning library for training and evaluating models.
Make sure to install these dependencies before running the code. You can typically install them using the following command:
pip install pandas numpy matplotlib seaborn nltk scikit-learn
The dataset used for this project consists of labeled tweets. The training dataset is loaded from deeptweets/train.csv
, and the testing dataset is loaded from deeptweets/test.csv
. The dataset includes the following columns:
TweetText
: The text of the tweet.Label
: The label indicating whether the tweet belongs to Politics or Sports.
TweetId | Label | TweetText |
---|---|---|
304271250237304833 | Politics | '#SecKerry: The value of the @StateDept and @USAID is measured, not in dollars, but in terms of our deepest American values.'" |
304834304222064640 | Politics | '@rraina1481 I fear so' |
303568995880144898 | Sports | 'Watch video highlights of the #wwc13 final between Australia and West Indies at http://t.co/lBXIIk3j' |
304366580664528896 | Sports | 'RT @chelscanlan: At Nitro Circus at #AlbertPark ! #theymakeitlooksoeasy @ausgrandprix @ChadwickModels #CantWaitForAusGP' |
296770931098009601 | Sports | '@cricketfox Always a good thing. Thanks for the feedback :-)' |
Data preprocessing is a crucial step in natural language processing tasks. In this project, the following preprocessing steps were performed:
-
Label Encoding: The labels were encoded to 0 for Sports and 1 for Politics.
-
Text Cleaning: Text data was cleaned using the following steps:
- Removal of Twitter handles (e.g., @username).
- Removal of special characters, numbers, and punctuation.
- Removal of short words (words with a length less than 3 characters).
- Removal of URLs.
- Removal of single quotes.
-
Text Tokenization and Stemming: Tokenization involves breaking down text into individual words (tokens). The words were then stemmed using the Porter Stemmer, reducing words to their root forms to aid in feature extraction.
Text data needs to be transformed into a format suitable for machine learning algorithms. In this project, the Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer was used for feature extraction. The TF-IDF vectorizer converts text data into numerical features that can be used to train machine learning models.
A Multinomial Naive Bayes classifier was used for this tweet classification task. The model was trained using the training data after feature extraction. The code for model training and evaluation is provided in the code section.
The performance of the trained model is evaluated using classification accuracy. A confusion matrix is also plotted to visualize the model's performance.
After training the model, it was used to make predictions on the test dataset. The predicted labels were converted to "Sports" or "Politics" based on a threshold, and the results were saved to a CSV file named predictions.csv
.
Feel free to explore the code and adapt it for your own classification tasks. If you have any questions or need further assistance, please don't hesitate to reach out. Good luck with your tweet classification endeavors!