With advancements in the multimedia sector, abundant videos are being generated for several applications. Video summarization has gained impetus as it can extract important parts of videos thus conserving the information. Moreover, understanding video content and generating a summary of the events can be useful for automating the process of video description which is being used in several online platforms. In this project, a Transformer-based Video to Text generator is proposed that can generate a concise summary for videos containing several scene changes. The proposed pipeline involves the identification of keyframes from the input videos, extraction of image features and caption generation, and finally summarization of the concatenated captions. Evaluation is done on a custom dataset, and the average BLEU 4 and ROGUE-f-measures obtained are 50.69% and 37.78% respectively.
Implementation is divided into main 3 components:
* Identification and extraction of keyframes from the input videos. * Generation of captions from the extracted images. * Concatenating captions and summarizing them.The entire flow of the methodology has been illustrated in Figure given below.
Following are the navigations for:
- / - colab code file
- videos/ - contains the video dataset used for testing
- my_transformer_weights/ - saved weights of transformer
- images/ - conatins project flow diagram and result images
- Harsh More
- Dhruva Khanwelkar
- Chirag Vaswani
- Juhi Rajani
- Nirvisha Soni