Skip to content

Latest commit

 

History

History
117 lines (84 loc) · 8.15 KB

File metadata and controls

117 lines (84 loc) · 8.15 KB

ViXlate

Video Translation using Deep Learning

CS 59000 - Application of Deep Learning - Course Project

Technologies Used

Python Keras TensorFlow Flask JavaScript HTML5 CSS3 OpenAI GitHub

Overview

As educational, informational, and entertainment videos become essential tools for learning and communication, language barriers can limit their reach and impact. A video translation system enables content to be understood by a broader audience, regardless of language, promoting inclusivity and ensuring that knowledge and ideas can be shared across cultures. AI-driven translation systems can process vast amounts of video content quickly and efficiently, significantly reducing the time and cost associated with manual translation. By leveraging advanced machine learning models, these systems achieve high levels of accuracy and contextual awareness, preserving the nuances and intent of the original content.

This project focuses on developing an automated system that uses deep learning techniques to translate videos from one language to another. The process involves extracting audio from the video, performing speech recognition to convert spoken words into text, translating that text into the desired language, generating new audio from the translation, and then seamlessly merging it with the original video.

Features

  • Support for YouTube URLs and direct video file uploads
  • Speech recognition powered by OpenAI Whisper
  • Multiple machine translation options, including Google Translate, Azure AI Translator, and MyMemory
  • Text-to-speech generation using Bark by Suno
  • Basic voice cloning capabilities
  • Seamless audio-video synchronization
  • Intuitive and user-friendly web interface

(back to top)

Models Used

  • Speech Recognition: OpenAI Whisper
  • Machine Translation: Google Translate, Azure AI Translator, MyMemory
  • Text-to-Speech: Bark by Suno

(back to top)

Methodology

image image
Workflow of the video translation application.

(back to top)

Results

During the initiation and requirements phase, we conducted thorough research into existing technologies and developed a methodology aligned with our timeline and objectives. We designed a modular workflow and divided the entire process into two phases, each reviewed separately to ensure progress and alignment with our goals.

Mid-Sem Review (Code)

For the first review, our main goal was to establish a strong foundation and make initial progress in each step of our process. We divided the work into four sub-tasks and aimed to advance each one. By the time of the review, we had made solid progress in all the sub-tasks, setting the stage for further development.

  • Step 01 - Extracting the text from the video

    For a given video, we tested several open-source deep learning models to compare the time it took to extract text from the video and the accuracy of the extracted text.

    image
    Execution time for the initial steps and text extraction using different models.
  • Step 02 - Machine Translation

    Similar to the first step, we tested multiple models, including Google Translate, and MyMemory, and compared their execution times and translation accuracy.

    image
    Execution time for the machine translation using different models.
  • Step 03 - Text-to-Speech

    For the text-to-speech step, we utilized Suno's Bark model (https://github.com/suno-ai/bark), which employs transformers for voice generation and supports 13 languages. Given that a video can contain lengthy speech, we also experimented with using NLTK to split the extracted text into individual sentences. Each sentence was processed separately and then merged into a single .wav file as the final output.

  • Step 04 - Combining audio and video into the outcome

    We used the MoviePy Python library to merge the newly generated audio with the existing video, first removing the original audio from the video.

End-Sem Review (Code)

Since the mid-semester review, our focus has been on improving the efficiency of the entire process and reducing the time required to translate any video. For the end-semester review, we concentrated on implementing more optimized algorithms at each step to achieve these goals.

  • Step 02 - Machine Translation

    We utilized OpenAI Whisper to combine Step 01 and Step 02 into a single, streamlined step. With Whisper, we were able to transcribe and translate the video simultaneously, enhancing efficiency and reducing processing time.

    image
    Execution time for the machine translation using OpenAI Whisper.
  • Step 03 - Text-to-Speech

    For text-to-speech, we found that the Suno Bark model was slow in generating speech and lacked strong audio capabilities. To address this, we experimented with CoquiTTS, which supports more languages and delivers better overall speech generation with greater fluency. Additionally, CoquiTTS includes basic voice cloning capabilities, which align well with our project's needs.

  • Step 04 - Combining audio and video into the outcome

    For audio and video merging, we have switched to using the FFmpeg utility, which offers greater flexibility and speed compared to the MoviePy library.

  • Flask app

    We also developed a Flask application to simplify user interaction by hiding the complexities of the underlying system.

(back to top)

Team

Acknowledgments

  • OpenAI for the Whisper model
  • Google, Microsoft Azure, and MyMemory for translation services
  • Suno for the Bark text-to-speech model

(back to top)