Movie Synopsis Text Clustering using K-Means Clustering and TF-IDF Vectorizer and deployment using framework Flask.
This project is a major assignment project for the second semester of the natural language processing course. The objective of this task is to collect movie synopsis data, perform text clustering on the movie synopsis data into k amount of clusters, and transform it into a web application.
In this project, we use K-Means Clustering to perform text clustering and TF-IDF Vectorizer as Word Embedding (to convert text data into vectors).
- Python
- Pandas
- Numpy
- Matplotlib
- Seaborn
- Scikit-learn
- Wordcloud
- Requests
- Flask
- The dataset used in this project is the movie synopsis data obtained using The Movie Database (TMDB) API. The retrieved data includes movie titles and their synopses only, while genres are not included because they will be predicted in an unsupervised manner by the k-means model during the text clustering process.
- Despite not including the genres feature, the movie data collected consists of films from various diverse genres. There are a total of 19 genres across all the movie data collected, and the list of these genres is as follows:
- Drama
- Crime
- Comedy
- Action
- Thriller
- Documentary
- Adventure
- Science Fiction
- Animation
- Family
- Romance
- Mystery
- Horror
- Fantasy
- War
- Music
- History
- Western
- TV Movie
- The quantity of collected data is 8214 movie and synopsis data
- The dataset is stored in data/movie_synopsis.csv
The data preprocessing steps applied to the data include:
- Remove missing values: Delete instances with missing values, such as movies lacking any synopsis text.
- Case folding: Transform all letters into lowercase.
- Train the TF-IDF Vectorizer model.
-
The model is trained using the K-Means algorithm to perform clustering on movie synopsis data
-
The training process is conducted using a parameter k (number of clusters) set to 14. Although the original number of genres for movies is 19, this value serves as an initial parameter, which will later be evaluated using the Elbow Method and Silhouette Score.
-
The trained model is saved to models/kmeans_model.sav
-
After the training model is completed, each data point is assigned a label based on its cluster number, and then saved to the file data/movie_synopsis_labeled.csv.
-
Data distribution based on cluster:
Note: There's imbalanced data on cluster 6
-
Each cluster has its own feature names, which are words located at the centroid of each cluster, representing the genre of that cluster. Ten feature names are selected for each cluster, and then saved to file data/feature_names.csv.
-
The following are word cloud representations of feature names for each cluster.
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10 Cluster 11 Cluster 12 Cluster 13 -
For the next steps, the trained model can be used to:
- Predict clusters based on the entered movie synopsis.
- Provide movie recommendations based on the entered synopsis (by calculating cosine similarity between the entered synopsis and the list of movies in the database, and returning movies that are close to the entered synopsis).
- Provide movie recommendations based on the entered title (by obtaining the synopsis from the entered movie title, then performing the same steps as in the previous point, i.e., calculating cosine similarity).
-
Elbow Method (SSE):
-
Silhouette Score:
- 0.004184973145629372
- Note: Silhouette score ranges from -1 to 1, where a higher score indicates better-defined clusters.
K-Means Model Training | Clustering Result |
---|---|
Feature Names | Data per Clusters |
Cluster Prediction (By Synopsis) | Cluster Prediction (By Title) |
- Web App Movie Synopsis Clustering: http://xxx.xxx
- Perform k-means model training with input k (number of clusters).
- Download K-Means model
- Download data that has been labeled using trained k-means model
- The Elbow Method evaluation visualization
- The Silhouette Score evaluation
- The 2 dimension PCA (Principal Component Analysis) data visualization
- The 3 dimension PCA (Principal Component Analysis) data visualization
- Feature names list for each cluster
- WordCloud visualization of each Feature Names
- Preview and download data per clusters
- Cluster prediction + related movie recommendation from input synopsis
- Cluster prediction + related movie recommendation from input movie title
- Clone this repo
git clone https://github.com/LinggarM/Movie-Synopsis-Text-Clustering
- Open the repo folder you have cloned in your PC
- Create a virtual environment
python -m venv myenv
- Activate the virtual environment
myenv/Scripts/activate or "myenv/Scripts/activate" (Windows) myenv/bin/activate (Linux)
- Install the requirements/ dependencies
pip install -r requirements.txt
- Open CMD in Repository Folder
- Run the web app by executing this command :
or :
python app.py
run Flask
- Open the given URL
http://127.0.0.1:5000/
- Input cluster number (k), and click "train model" to start the training process. Wait for a while until the training process finished, and you will be redirected to Clustering result page with the information of the finished clustering model training process.
This project is licensed under the MIT License - see the LICENSE file for details
- Material Dashboard - v2.1.2 for HTML templates