The primary objective of "Classifying Violence Text" is to build a robust machine learning model capable of categorizing text documents into specific violence classes. By analyzing the textual content, we strive to uncover patterns and features that differentiate the various forms of violence, thereby contributing to our collective effort in addressing these pressing societal challenges.
For this undertaking, I utilized a dataset sourced from Kaggle, comprising textual descriptions of violence incidents, meticulously labeled with five distinct classes representing the different forms of violence. This dataset serves as the foundation for training, validating, and evaluating our classification model.
To ensure the model's efficacy, we subjected the text data to essential preprocessing steps, including:
- Converting all text to lowercase to maintain uniformity in text representation.
- Removing punctuations to focus solely on the meaningful words.
- Eliminating common stop words to reduce noise and enhance signal detection.
- Applying lemmatization to simplify the words and reduce them to their base or root form.
In order to convert the preprocessed text into numerical features, we leveraged the TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer. This powerful technique captures not only the significance of individual words in a document but also their relevance across the entire corpus.
The Machine Learning model we chose for this classification task is the widely-used Logistic Regression algorithm. Renowned for its simplicity and efficiency in binary and multi-class classification problems, Logistic Regression proved to be a suitable choice for our project. We trained the model on the transformed TF-IDF features to effectively categorize the text documents into their respective violence classes.
In this data science endeavor, I harnessed the capabilities of various Python libraries, including:
- Pandas: For seamless data manipulation and analysis.
- Matplotlib and Seaborn: For creating insightful data visualizations and plots.
- Scikit-Learn: To implement and evaluate the machine learning models.
The trained Logistic Regression model demonstrated commendable performance in classifying violence text, achieving notable accuracy and precision across the violence classes. The insights gleaned from this classification effort offer valuable understanding into the distinct characteristics of each form of violence, potentially aiding in devising effective prevention strategies and support mechanisms.
Tools Used: Python, Pandas, Matplotlib, Seaborn, Scikit-Learn