Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DMP 2024]: Clustering large amount of videos #81

Closed
6 tasks
dennyabrain opened this issue Feb 16, 2024 · 16 comments
Closed
6 tasks

[DMP 2024]: Clustering large amount of videos #81

dennyabrain opened this issue Feb 16, 2024 · 16 comments
Assignees
Labels

Comments

@dennyabrain
Copy link
Contributor

dennyabrain commented Feb 16, 2024

Ticket Contents

Description

Feluda allows researchers, factcheckers and journalists to explore and analyze large quantity of multimeda content. One important modality on Indian social media is video. The scope of this task is to explore various automated techniques suited for this task and after consultation with the team, implement an end to end workflow that can be used to surface visual or temporal trends in a large collection of videos.

Goals

  • Review Literature with our team and do research and prototyping to review state of the art ML and classical DSP techniques
  • Optimize the solution for consistent RAM and CPU usage (limit the spikes caused by variables like file size, video length etc) since it will need to scale up for million videos.
  • Integrate the solution into Feluda by creating a operator that adheres to Feluda operator's interface

Expected Outcome

Feluda's goal is to provide a simple CLI or scriptable interface for Analysing multimodal social media data. In that vein, all the work that you do should be executable and configurable via scripts and config files. The solution should look at feluda's architecture and its various components to identify best ways to enable this.
The solution should have a way to configure data source (database with file IDs or a S3 bucket with files), specify and implement the data processing pipeline and where the result will be stored. Our current implementation uses S3 and SQL database for data source and Elasticsearch for storing result but additional sources or stores can be added if apt for this project.

Acceptance Criteria

  • Regular Interactive Demos with the team using a public jupyter notebook pushed to our experiments repository
  • Working feluda operator with tests that can be run as an independent worker in the cloud to schedule processing jobs over a large dataset
  • Output Structured data that can be passed onto a UI service (web or mobile) for downstream use cases

Implementation Details

One way we have approached this is by using Vector Embeddings. We have done this to great success to surface visual trends in Images. We used ResNet model to generate vector embeddings and store them in elasticsearch. We also used t-sne to reduce the dimensions of the vector embeddings to then display them in a 2D visualization. It can be viewed here
A detailed report over feluda's usage in a project to analyze images can be read here
The relevant feluda operator can be studied here
The code for tsne is here
A prior study of various ways to get insights out of images has been documented here

Mockups/Wireframes

This is an interactive visualization of Image clustering done using Feluda.
Screenshot 2024-02-16 at 08-16-56 Tattle - articles
Doing UI development or integrating with any UI software is not part of this project but it might help to see what sort of downstream applications we use Feluda for.

Product Name

Feluda

Organisation Name

Tattle

Domain

Open Source Library

Tech Skills Needed

Computer Vision, Docker, Machine Learning, Performance Improvement, Python

Mentor(s)

@dennyabrain @duggalsu

Category

Data Science, Machine Learning

@Sayanjones
Copy link

Hey @dennyabrain I'm Sayan, am interested in contributing to the video analysis project! My skills in computer vision, machine learning, and Python are a great fit. I'm eager to explore video analysis using techniques like vector embeddings.

Proficient in Docker and performance optimization, I can ensure the solution scales efficiently. I value open-source development and look forward to contributing demos.

Is there a way you prefer for me to reach out? I'm looking forward to exploring how I can contribute.

@dennyabrain
Copy link
Contributor Author

Hi @Sayanjones we can use this issue to communicate approaches. If you start concretely implementing something, you can make a new issue specific to your approach and we can take the conversation there.

@Ris-code
Copy link

Hi @dennyabrain

I'm Rishav Aich, pursuing my BTech in artificial intelligence and data science from IIT Jodhpur. Being a student of AI, I have done courses on deep learning, machine learning, and AI. I am proficient in C++, Python, and R programming languages. I have a strong background in development, more specifically, backend development. I have used Docker in various projects.

This project completely aligns with my skills. It would be great to contribute to this.

Please advise me on how to get started with the project.

@AbhimanyuSamagra
Copy link

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries.

@Aryankb
Copy link

Aryankb commented Apr 15, 2024

Hey @dennyabrain , This is Aryan from IIIT - Naya Raipur, I am currently persuing my B. Tech in DATA SCIENCE AND ARTIFICIAL INTELLIGENCE. I have good experience in deep learning , computer vision, and NLP. I've worked on several projects, such as self-driving cars using camera input. I am really excited to work on this project as I feel this is a perfect match for me. Also, I am going to learn Docker in the future.

@dennyabrain
Copy link
Contributor Author

Hi everyone,

Thank you for expressing interest in this issue. Depending on your interests and skills, you can take ANY ONE of the following approaches :

  1. Look at the problem statement and propose your approach
    Remember the main problem statement - Given a large number of video files, find a way to group identical and similar video files. This approach would be ideal for anyone who is interested in or studies ML and/or DSP. By thinking about the problem statement, reviewing existing literature on it and proposing your approach here, we would all learn something from it and the mentors should be able to nudge you in the right direction.

  2. Try getting feluda working on your machine
    Feluda is a moderately complex software and has many moving parts. Getting it working on your machine itself can be a challenge. We have a guide on it here. If you are is a software developer/tinkerer, this might be a good place to start because once you have Feluda working locally and you can see the various existing functionalities, that might give you an idea of how to proceed.

  3. Recreate our code on a jupyter notebook or google collab notebook
    We already have some code that takes video files and converts them into vectors. We also have code that takes these vectors and clusters them. I would take this approach if you are a software engineer with some ML engineering skills and you know your way around using ML models. Once you get this working on your notebook we can try out different pretrained models to evaluate performance.

You'll have me or members from our team to guide if you get stuck on any of these approaches. Taking some conrete steps on any of these 3 steps would help us know what your interests and skills are and give you concrete feedback when you get stuck.

All the best!

@AbhimanyuSamagra
Copy link

Do not ask process related questions about how to apply and who to contact in the above ticket. The only questions allowed are about technical aspects of the project itself. If you want help with the process, you can refer instructions listed on Unstop and any further queries can be taken up on our Discord channel titled DMP queries. Here's a Video Tutorial on how to submit a proposal for a project.

@Aryankb
Copy link

Aryankb commented Apr 25, 2024

Hey @dennyabrain , i have some queries regarding the project :-

  • what will be the length of videos?
  • Is there any available dataset with pre-defined classes ?
  • A video is a combination of audio, images and texts. what should be the most important classification criteria out of these?
  • How many classes should be there for classification? please give some examples.

@aatmanvaidya
Copy link
Collaborator

Hey @dennyabrain , i have some queries regarding the project :-

  • what will be the length of videos?
  • Is there any available dataset with pre-defined classes ?
  • A video is a combination of audio, images and texts. what should be the most important classification criteria out of these?
  • How many classes should be there for classification? please give some examples.

Hi @Aryankb

  1. Generally, expect the length to be anywhere between 30sec - 20mins.
  2. Currently we don't have a dataset with pre-defined classes, but feel free to look for such datasets
  3. To the best of my knowledge, a video is just a series of images, so to answer your question, the most important classification criteria would be image. Please investigate a bit deeper into this. Also take a look at the 3rd point in @dennyabrain comment. That is an example of clustering images using a certain type of embedding.
  4. There is no specific number of classes, but think of classes as metadata to these videos in the context of social media. Some examples could be - memes, political, health, paper documents, news etc, these are very broad labels, you can think of some specific ones too.

Hope this helps

@Mithilesh1609
Copy link

Hey @dennyabrain, Mithilesh here, I have experience and passion for creating end-to-end, highly scalable computer vision pipelines, I am working with a young start-up as a machine learning engineer, I have led a similar project implementation for one of the largest edTech companies in the world, where we worked on clustering on a similar type of video(avg length of 10 mins) and then recommended video based on user mistake in the test, where we work with embedding creation and efficient search algorithm, apart from this I have lead creating and scaling of the computer vision based exam grading tool from 50 users to 4 million users with docker and AWS, and bring down the running time by 70% over three iterations and that help government organize world's largest AI graded examination. I am very eager to contribute in this project and make clustering of video more efficient and scale it fast.

@Aryankb
Copy link

Aryankb commented May 1, 2024

Hey @dennyabrain , I am Aryan Kumar Baghel, from IIIT - NAYA RAIPUR
I was exploring the ways to extract unique frames from the video. I tried to extract unique keyframes from some videos using ffmpeg - to extract keyframes from the video , k-means- to extract unique keyframes from keyframes extracted by ffmpeg, and here are the results :-

(We can select one image from each cluster, as the representation of that cluster, then further we can use some image captioning models to generate small captions for each image. Next we can combine all captions to generate the final caption for the video or use them to classify the video accurately.)

Google Collab Notebook

Video Link : https://drive.google.com/file/d/1Qr08m4Bf0JjTszExDLoey2LCqcJjJl3n/view?usp=drive_link
Clusters :
image
image
image

Video 2 link : https://drive.google.com/file/d/1QnupjsK7ILQUYrqlPT2pTdTAzoy8Wi-C/view?usp=drive_link
Clusters
:
image
image
image

I'll be now working on ways to cluster the images such that it selects the no. of clusters automatically, Please give your reviews and directions for the future work.

@aaradhyasinghgaur
Copy link

Hi everyone,

Thank you for expressing interest in this issue. Depending on your interests and skills, you can take ANY ONE of the following approaches :

1. Look at the problem statement and propose your approach
   Remember the main problem statement - Given a large number of video files, find a way to group identical and similar video files. This approach would be ideal for anyone who is interested in or studies ML and/or DSP. By thinking about the problem statement, reviewing existing literature on it and proposing your approach here, we would all learn something from it and the mentors should be able to nudge you in the right direction.

2. Try getting feluda working on your machine
   Feluda is a moderately complex software and has many moving parts. Getting it working on your machine itself can be a challenge. We have a guide on it [here](https://github.com/tattle-made/feluda/wiki/Setup-Feluda-Locally). If you are is a software developer/tinkerer, this might be a good place to start because once you have Feluda working locally and you can see the various existing functionalities, that might give you an idea of how to proceed.

3. Recreate our code on a jupyter notebook or google collab notebook
   We already have some code that takes [video files and converts them into vectors](https://github.com/tattle-made/feluda/blob/main/src/core/operators/vid_vec_rep_resnet.py). We also have code that takes these vectors and [clusters them](https://github.com/tattle-made/data-experiments/blob/master/tSNE-clustering.ipynb). I would take this approach if you are a software engineer with some ML engineering skills and you know your way around using ML models. Once you get this working on your notebook we can try out different pretrained models to evaluate performance.

You'll have me or members from our team to guide if you get stuck on any of these approaches. Taking some conrete steps on any of these 3 steps would help us know what your interests and skills are and give you concrete feedback when you get stuck.

All the best!

Hey @dennyabrain ,
I'm Aaradhya Singh , currently a 2nd year undergrad of computer science and engineering , proficcient in C/C++ , python , deep learning and machine learning and a researcher and learner for various upcoming technlogies and tech stacks...after reading at your suggested approches ......I might be able to fine tune some models to the efficiency which are mostly built upon CNN/RNN architectures and use pipeline/heirarchical approach to solve the complex problem of the classification or creating clusters of the content....looking forward to work on it and updating on my findings

@dennyabrain
Copy link
Contributor Author

@Snehil-Shah can you comment here, so I can assign the issue to you?

@Snehil-Shah
Copy link
Contributor

@dennyabrain Yes.

@aatmanvaidya
Copy link
Collaborator

aatmanvaidya commented Jun 17, 2024

Weekly Goals

Week 1

  • Set up my local development environment and workflow

Week 2

  • create a mixed dataset of 150-200 datasets
  • Run Feluda Video Operator on a video dataset, reduce dimensions using t-SNE and do a visual plot - This will act as a baseline for us
  • Embeddings models - CliP, VideoMAE
  • Visual display it using t-SNE

Week 3

  • Try out more Video Models for Embedding
  • Try self-supervised clustering algorithms
  • Do a review of lit for video pre-processing. Read up on better sampling strategies.

Week 4

  • run video models on dataset and visualize using t-SNE, also do k-means clustering
  • implement sampling strategies for video.

Week 5

  • read and review sampling strategies
  • Explore - closer frames as input, input higher moving frames etc.
  • how to select better n key frames.

Week 6

  • Profile the CLIP embedding model.
  • keep experimenting with more sampling strategies.
  • collect dataset of low-quality videos in Indian Languages.

Week 7

  • complete/ build on the Indian videos dataset
  • Run the approach end-to-end on the new dataset. Just clustering using CLIP
  • Run Zero shot approach and cluster videos on new dataset.
  • profile CLIP on 20min, 30min and 1hr.

Week 8

  • run zero shot on custom dataset
  • profile CLIP on 20min, 30min and 1hr.
  • write a basic video operator using CLIP.

Week 9

  • run prod data on zero-shot method, auto clustering and play around with cluster numbers
  • finish writing the CLIP operator and profile it for large vidoes
  • write a test for the operator
  • write an operator for the zero shot method

Week 10

  • run zero shot on prod data
  • write a new operator for clustering
  • analyse zero shot prod results

Week 11

  • finish writing an operator for clustering
  • write test for clustering operator
  • run and analyse zero shot prod results
  • plan on how to visualise clustering results

Week 12

  • write a worker for clustering
  • visualise clustering results in t-SNE
  • document CLIP operator

Week 13

  • continue writing the worker
  • document CLIP operator
  • document the clustering operator

@Snehil-Shah
Copy link
Contributor

Snehil-Shah commented Jun 25, 2024

Weekly Learnings & Updates

Week 1

  • Set up my local development environment and workflow.
  • Set up a timeline and weekly check-ins with the mentors as part of the onboarding process.

Week 2

  • Benchmarked popular image embedding models to extract semantic features from video frames and average them into a video vector.
  • Compared pre-trained models like ResNet18, CLIP-ViT-B-32, EfficientNet-B0 and DeiT-medium-16.
  • Neurally encoded around 100 videos from a combined UCF101 subset and a custom dataset of popular topics like memes, nature, and commentary.
  • Plotted t-SNE reduced vectors to evaluate clustering and visual distribution:
    clip
  • Clustered them using k-means and examined each cluster to evaluate clusters and spot outliers:
    image

Week 3

  • Reviewed literature and ran experiments on video transformers and 3D neural net architectures.
  • Benchmarked various video embedding models to extract active features and capture frame interpolation.
  • Compared pre-trained models like I3D-R50, R3D-18, SlowFast-R50, VideoMAE, ViViT and X-CLIP.
  • Ran inference using the above tests by plotting t-SNE reduced vectors and individual clusters.
  • Achieved near zero outliers and true action recognition that was missing with image embedding models.
  • Successfully clustered 100 videos into 13 classes each correctly corresponding to the original classes, all with just one outlier.
  • Finalized a hybrid approach simulating a multi-stream pipeline to capture the static and active aspects of a video.
  • Notebook containing all benchmarks, experiments, inferences, and opinions mentioned till now for reference.

Week 4

  • Explored and implemented zero-shot classification with promising results utilizing the multi-modal nature of CLIP-based transformer models.
    This meant being able to classify videos into newer classes without any fine-tuning and retraining of a new linear head. It works by relying on vector similarity between video embeddings and text embeddings (made from the class names) in a common vector space.
    image
    image
    This would allow fact-checkers and researchers to surface visual trends based on labels such as "newspaper", "screenshots", "memes" etc. without any additional training overhead.
  • Benchmarked various clustering algorithms from scikit-learn on efficiency and results by plotting individual clusters.
  • Finalized k-means and agglomerative clustering if the number of clusters are known and Affinity Propagation for unknown number of clusters.
  • Notebook containing all benchmarks, experiments, inferences, and opinions on zero-shot classification and clustering algorithms.

Week 5

  • Explored various frame sampling strategies for sampling both static and active aspects of a video.
  • Tried sampling static aspects using methods like QR decomposition, shot-transition detection, sampling I-frames using ffmpeg, and sampling cluster centroids by clustering frames.
    image
  • Tried sampling active aspects using methods like simple RGB subtraction between near-adjacent frames, farneback's optical flow algorithm and improved it with background subtraction.
  • Extracting the most active parts of the video is tricky as it can be receptive to noise like frequent shot transitions and shaky camera work but datasets like the kinetics (most video embedding models mentioned above are trained on it) contain short, simple action sequences with a static background and a clear subject. Basically, the model wouldn't be able to identify action from multi-angle action sequences as seen in movies.
    One solution can be using shot transition detection to isolate each shot using shot-transition detection, and then individually measuring optical flow in each window.

Week 6

  • Built a custom dataset of ~120 videos depicting the Indian social media context with videos of varying lengths, qualities, and subjects meant to capture the diversity of media the operator can expect when deployed for downstream tasks like fact-checking and finding visual trends on social media research in India.
    This will allow us to run inference on production expectations and further tune the pipeline.
  • Profiled and benchmarked CLIP-ViT-B-32 and ResNet18 for CPU and memory usage using memray and pyinstrument to estimate deployment requirements.
    image

Week 7

  • Ran inference on our clustering and zero-shot classification pipelines using our custom dataset to gauge performance on our use case.
    image
  • Achieved 76-82% accuracy on our zero-shot classifier. This accuracy is really good given there was no additional training done on our custom dataset (consisting of various indian-context videos of varied lengths and qualities) for classes the model has never seen before.
    image
    image
  • Notebooks containing inferences on clustering and zero-shot classification using a custom dataset.

Week 8

  • Worked on a Feluda operator implementing the above pipeline and wrote tests for it, adhering to the Feluda operator interface.
  • Pull request for the same.

Week 9

  • Profiled the above operator for CPU and memory usage using memray and pyinstrument to estimate deployment requirements.
    image
    image
    Check out the full profiling findings and conclusions here.
  • Worked on a Feluda operator for video classification using a zero-shot approach, adhering to the Feluda operator interface.
  • Pull request for the same.

Week 10

  • On leave

Week 11

  • Worked on a simple Feluda operator for clustering embeddings from sources of various modalities and supporting multiple modes of operation, adhering to the Feluda operator interface.
  • Pull request for the same.

Week 12

  • Started work on the worker for clustering media items. Worked on setting up config files and Dockerfile for the Feluda worker with relevant operators and RabbitMQ queue configurations.
  • Documented the new operators.

Week 13

  • Completed the worker logic and payload writer.
  • Documented the worker.
  • Pull request for the worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants