In this project, I collect, clean, pre-process, augment, and build a high-quality dialogue dataset with a large distribution of topics. The goal was to train/fine-tune large-scale language models to predict dialogue-based conversations and to use the top language model to make real-time reply suggestions in a chat app. The feature is similar to Google's SmartCompose.
As of now, the dataset was only used to fine-tune Google's T5 model. This was the final tensorboard for the project.
Here's a copy of the final graduation project presentation. I was responsible for delivering the whole data science part of the project.
Create a virtual environment:
python -m venv chatbot-env
Activate it:
./chatbot-env/Scripts/activate.bat # In CMD
./chatbot-env/Scripts/Activate.ps1 # In Powershel
./chatbot-env/Scripts/activate # In linux/Mac OS X
Install the requirements:
python -m pip install -r requirements.txt
Notes:
-
Currently, I run the project locally on python 3.9. However, I might consider downgrading to 3.7 to match the current version of python on Google Colab.
-
I use PyTorch with Cuda v11.3 (same as the one currently on Google Colab). If you have a different Cuda version installed, you might need to update the requirements file. Have a look at this.