CinemAI is a one stop solution solution created for movie buffs. The fullstack app has a movie scene-wise and character-wise analysis feature. It also has genre classification, movie, movie rating prediction and age restriction prediction based on a movie's script.
The home page offers a wide range of movie options to choose from. Each movie poster can be resized to small, medium, large, or extra-large.
When a movie poster is clicked, an in-depth analysis is generated within seconds. The analysis page is divided into two halves:
-
Left Side:
- Contains the movie script.
- Includes a metadata button that, when clicked, replaces the movie script with details such as budget, rating, cast, awards, etc.
-
Right Side:
- Showcases various NLP techniques applied to the movie script, represented through interactive visualizations:
- Unveiling Movie Themes with Topic Modeling: Generates word clouds of prevalent topics in the shape of Melpomene and Thalia Theatre masks.
- Feel the Vibes: Vader Scene-wise Sentiment: Analyzes the emotional tone of the script on a scene-by-scene basis using sentiment analysis.
- Spotlight on Characters and Locations with NER: Identifies and classifies entities such as characters, locations, and organizations within the script using Named Entity Recognition (NER).
- A Character's Stage: Scenes per Character: Shows the frequency of each character's appearances throughout the movie.
- Words That Matter: Dialogue per Character: Highlights how often each character speaks in the movie.
- Emotion Unveiled: Scene-wise NRC Lexicon Analysis: Provides an emotional landscape of the script with NRC Lexicon Analysis, covering emotions like anger, anticipation, disgust, fear, joy, positive, negative, sadness, surprise, and trust.
- Decoding Language: POS Tags: Analyzes the parts of speech used in the script.
- Showcases various NLP techniques applied to the movie script, represented through interactive visualizations:
Plot.Pulse.-.Movie.OnClick.Analysis.mp4
There are three distinct BERT (Bidirectional Encoder Representations from Transformers) based applications that have been fine-tuned to predict the genre (multi-label), age restriction (numeric value), and IMDb rating (numeric value) based on movie scripts. Users can either type in a script in an empty box or autofill the box with a selection of movies from the dropdown menu.
-
Downloading additional (Large) Files
- Download the .H5 files, pre-trained tokenizers and NCR-lexion into the
Models Folder
from the reostiory link givem in README.md of the Models folder. - Download the Raw Folder, Processed Folder and cinema_mask.png into the
Data Folder
from the reostiory link givem in README.md of the Data folder. - Extract the compressed file
movie poster images.zip
present in App/static/images
- Download the .H5 files, pre-trained tokenizers and NCR-lexion into the
-
Install required libraries from requirements.txt (Command -> pip install -r requirements.txt)
-
Run app.py in
App Folder
as a Flask server
Note: Kindly email [email protected] to request acess to the above links by providing viable reasoning.
There are two .ipynb
Jupyter Notebooks in the Notebooks Folder.
- The
Movie Analysis - onClick Graph Rendering.ipynb
Jupyter Notebook has comprehensive tools in the form of Class Instances and FUnctions for analyzing movie scripts, with a particular example focused on the "X-Men" movie
- The
Model Training - BERT text classification & prediction.ipynb
Jupyter Notebook involves fine-tuning the BERT and RoBERTa models and using the models for various purposes with examples. The primary tasks are predicting IMDb user ratings, classifying genres, and predicting age restrictions based on movie scripts. For the purposes of demonstration a wide variety of paragraphical explanations and examples like movies, reddit posts and songs have been taken into consideration for the notebook.
This repository also presents OfficeGPT, a fine-tuned implementation of DialoGPT specifically designed to generate dialogue in the style of characters from the popular American television series "The Office." By leveraging existing resources, the model has been trained to capture the unique speech patterns and conversational dynamics of the show's characters.
OfficeGPT.Walkthrough.Video.mp4
Training Methodology: OfficeGPT employs a fine-tuning approach based on the DialoGPT architecture. The model was iteratively trained on a dataset of approximately 45,000 dialogue lines spanning all nine seasons of "The Office." This dataset was carefully curated to represent the speech of various characters, ensuring a comprehensive understanding of the show's dialogue style.
Implementation Details: A Jupyter Notebook (.ipynb) file, approximately 75 MB in size, is included in this repository. This notebook outlines the training process and is customized to work with a separate CSV script containing the dialogue dataset. The script can be downloaded from the following [public repository]([url](https://www.kaggle.com/datasets/nasirkhalid24/the-office-us-
Model Size and Availability: Due to the significant size (over 9 GB) of the trained OfficeGPT model, it has not been included in this repository for practical reasons. However, the provided Jupyter Notebook offers a comprehensive guide for those interested in replicating the training process and generating their own "Office"-inspired dialogues.