ViXlate

Video Translation using Deep Learning

CS 59000 - Application of Deep Learning - Course Project

Technologies Used

Overview

As educational, informational, and entertainment videos become essential tools for learning and communication, language barriers can limit their reach and impact. A video translation system enables content to be understood by a broader audience, regardless of language, promoting inclusivity and ensuring that knowledge and ideas can be shared across cultures. AI-driven translation systems can process vast amounts of video content quickly and efficiently, significantly reducing the time and cost associated with manual translation. By leveraging advanced machine learning models, these systems achieve high levels of accuracy and contextual awareness, preserving the nuances and intent of the original content.

This project focuses on developing an automated system that uses deep learning techniques to translate videos from one language to another. The process involves extracting audio from the video, performing speech recognition to convert spoken words into text, translating that text into the desired language, generating new audio from the translation, and then seamlessly merging it with the original video.

Features

Support for YouTube URLs and direct video file uploads
Speech recognition powered by OpenAI Whisper
Multiple machine translation options, including Google Translate, Azure AI Translator, and MyMemory
Text-to-speech generation using Bark by Suno
Basic voice cloning capabilities
Seamless audio-video synchronization
Intuitive and user-friendly web interface

(back to top)

Models Used

Speech Recognition: OpenAI Whisper
Machine Translation: Google Translate, Azure AI Translator, MyMemory
Text-to-Speech: Bark by Suno

(back to top)

Methodology

Workflow of the video translation application.

(back to top)

Results

During the initiation and requirements phase, we conducted thorough research into existing technologies and developed a methodology aligned with our timeline and objectives. We designed a modular workflow and divided the entire process into two phases, each reviewed separately to ensure progress and alignment with our goals.

Mid-Sem Review (Code)

For the first review, our main goal was to establish a strong foundation and make initial progress in each step of our process. We divided the work into four sub-tasks and aimed to advance each one. By the time of the review, we had made solid progress in all the sub-tasks, setting the stage for further development.

Step 01 - Extracting the text from the video

For a given video, we tested several open-source deep learning models to compare the time it took to extract text from the video and the accuracy of the extracted text.

Execution time for the initial steps and text extraction using different models.
Step 02 - Machine Translation

Similar to the first step, we tested multiple models, including Google Translate, and MyMemory, and compared their execution times and translation accuracy.

Execution time for the machine translation using different models.
Step 03 - Text-to-Speech

For the text-to-speech step, we utilized Suno's Bark model (https://github.com/suno-ai/bark), which employs transformers for voice generation and supports 13 languages. Given that a video can contain lengthy speech, we also experimented with using NLTK to split the extracted text into individual sentences. Each sentence was processed separately and then merged into a single .wav file as the final output.
Step 04 - Combining audio and video into the outcome

We used the MoviePy Python library to merge the newly generated audio with the existing video, first removing the original audio from the video.

End-Sem Review (Code)

Since the mid-semester review, our focus has been on improving the efficiency of the entire process and reducing the time required to translate any video. For the end-semester review, we concentrated on implementing more optimized algorithms at each step to achieve these goals.

Step 02 - Machine Translation

We utilized OpenAI Whisper to combine Step 01 and Step 02 into a single, streamlined step. With Whisper, we were able to transcribe and translate the video simultaneously, enhancing efficiency and reducing processing time.

Execution time for the machine translation using OpenAI Whisper.
Step 03 - Text-to-Speech

For text-to-speech, we found that the Suno Bark model was slow in generating speech and lacked strong audio capabilities. To address this, we experimented with CoquiTTS, which supports more languages and delivers better overall speech generation with greater fluency. Additionally, CoquiTTS includes basic voice cloning capabilities, which align well with our project's needs.
Step 04 - Combining audio and video into the outcome

For audio and video merging, we have switched to using the FFmpeg utility, which offers greater flexibility and speed compared to the MoviePy library.
Flask app

We also developed a Flask application to simplify user interaction by hiding the complexities of the underlying system.

(back to top)

Team

Acknowledgments

OpenAI for the Whisper model
Google, Microsoft Azure, and MyMemory for translation services
Suno for the Bark text-to-speech model

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
End_Sem_Update		End_Sem_Update
Media_Content		Media_Content
Mid_Sem_Project_Update		Mid_Sem_Project_Update
Project_Initiation_and_Requirements		Project_Initiation_and_Requirements
.DS_Store		.DS_Store
README.md		README.md
Steps to Run - Final Demo.txt		Steps to Run - Final Demo.txt
Video Translation - Mid Sem Flow.ipynb		Video Translation - Mid Sem Flow.ipynb
Video_Translation_Final_Demo.ipynb		Video_Translation_Final_Demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViXlate

Video Translation using Deep Learning

CS 59000 - Application of Deep Learning - Course Project

Technologies Used

Overview

Features

Models Used

Methodology

Results

Mid-Sem Review (Code)

End-Sem Review (Code)

Team

Acknowledgments

About

Releases

Packages

Contributors 4

Languages

ChiragBellara/Video-Translation-Using-Deep-Learning

Folders and files

Latest commit

History

Repository files navigation

ViXlate

Video Translation using Deep Learning

CS 59000 - Application of Deep Learning - Course Project

Technologies Used

Overview

Features

Models Used

Methodology

Results

Mid-Sem Review (Code)

End-Sem Review (Code)

Team

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages