Merge pull request snap-research#63 from tsaishien-chen/main

Add additional annotations
ultranity · Oct 25, 2024 · 9afd40b · 9afd40b
2 parents bbae2b1 + 292d4df
commit 9afd40b
Show file tree

Hide file tree

Showing 12 changed files with 80 additions and 27 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1 +1 @@
-Copyright Snap Inc. 2024. This dataset is made available by Snap Inc. for informational purposes only. No license, whether implied or otherwise, is granted in or to such dataset (including any rights to copy, modify, publish, distribute and/or commercialize such dataset), unless you have entered into a separate agreement for such rights. Such dataset is provided as-is, without warranty of any kind, express or implied, including any warranties of merchantability, title, fitness for a particular purpose, non-infringement, or that such dataset is free of defects, errors or viruses. In no event will Snap Inc. be liable for any damages or losses of any kind arising from the dataset or your use thereof.
+Copyright (c) 2024 Snap Inc. All rights reserved. This dataset and code is made available by Snap Inc. for non-commercial, research purposes only. Non-commercial means not primarily intended for or directed towards commercial advantage or monetary compensation. Research purposes mean solely for study, instruction, or non-commercial research, testing or validation. No commercial license, whether implied or otherwise, is granted in or to this dataset and code, unless you have entered into a separate agreement with Snap Inc. for such rights. This dataset and code is provided as-is, without warranty of any kind, express or implied, including any warranties of merchantability, title, fitness for a particular purpose, non-infringement, or that the code is free of defects, errors or viruses. In no event will Snap Inc. be liable for any damages or losses of any kind arising from this dataset and code or your use thereof. Any redistribution of this dataset and code must retain or reproduce the above copyright notice, conditions and disclaimer.
diff --git a/README.md b/README.md
@@ -28,7 +28,40 @@ This repository have three sections:
 - [Splitting](./splitting) includes the code to split a long video into multiple semantics-consistent short clips.
 - [Captioning](./captioning) includes the proposed video captioning model trained on Panda-70M.
 
+## 🔥 Updates (Oct 2024)
+To enhance the training of video generation models, which are intereted at *single-shot* videos with *meaningful motion* and *aesthetically pleasing scenes*, we introduce two additional annotations:
+
+- **Desirability Filtering**: This annotation assesses whether a video is a suitable training sample. We categorize videos into six groups based on their characteristics: `desirable`, `0_low_desirable_score`, `1_still_foreground_image`, `2_tiny_camera_movement`, `3_screen_in_screen`, `4_computer_screen_recording`. In the below table, we present examples for each category along with the percentage of videos within the dataset.
+- **Shot Boundary Detection**: This annotation provides a list of intervals representing continuous shots within a video (predicted by [TransNetV2](https://github.com/soCzech/TransNetV2)). If the length of the list is one, it indicates the video consists of a single continuous shot without any shot boundaries.
+
+<table class="center">
+  <tr>
+    <td width=33.3% style="border: none"><img src="./assets/2VcOUDaJcnk.56.gif"></td>
+    <td width=33.3% style="border: none"><img src="./assets/2qUj6j7zLOQ.41.gif"></td>
+    <td width=33.3% style="border: none"><img src="./assets/SimjcPdKPkE.26.gif"></td>
+  </tr>
+  <tr style="text-align: center;">
+    <td width=33.3% style="border: none">desirable (80.5%)</td>
+    <td width=33.3% style="border: none">0_low_desirable_score (5.28%)</td>
+    <td width=33.3% style="border: none">1_still_foreground_image (6.82%)</td>
+  </tr>
+
+  <table class="center">
+  <tr>
+    <td width=33.3% style="border: none"><img src="./assets/eOfsBLszShI.16.gif"></td>
+    <td width=33.3% style="border: none"><img src="./assets/3W9ck1YVx2I.15.gif"></td>
+    <td width=33.3% style="border: none"><img src="./assets/14gEWADjcOI.6.gif"></td>
+  </tr>
+  <tr>
+    <td width=33.3% style="border: none">2_tiny_camera_movement (1.20%)</td>
+    <td width=33.3% style="border: none">3_screen_in_screen (5.03%)</td>
+    <td width=33.3% style="border: none">4_computer_screen_recording (1.13%)</td>
+  </tr>
+</table>
+<sup>**We will remove the video samples from our dataset / Github / project webpage / technical presentation as long as you need it. Please contact tsaishienchen at gmail dot com for the request.</sup>
+
 ## Dataset
+
 ### Collection Pipeline
 <p align="center" width="100%">
 <a target="_blank"><img src="assets/collection_pipeline.gif" style="width: 100%; min-width: 200px; display: block; margin: auto;"></a>
@@ -37,11 +70,11 @@ This repository have three sections:
 ### Download
   | Split           | Download | # Source Videos | # Samples | Video Duration | Storage Space|
   |-----------------|----------|-----------------|-----------|----------------|--------------|
-  | Training (full) | [link](https://drive.google.com/file/d/1DeODUcdJCEfnTjJywM-ObmrlVg-wsvwz/view?usp=sharing) (2.01 GB) | 3,779,763 | 70,723,513 | 167 khrs  | ~36 TB  |
-  | Training (10M)  | [link](https://drive.google.com/file/d/1Lrsb65HTJ2hS7Iuy6iPCmjoc3abbEcAX/view?usp=sharing) (381 MB)  | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB |
-  | Training (2M)   | [link](https://drive.google.com/file/d/1jWTNGjb-hkKiPHXIbEA5CnFwjhA-Fq_Q/view?usp=sharing) (86.5 MB) | 800,000   | 2,400,000  | 7.56 khrs | ~1.6 TB |
-  | Validation      | [link](https://drive.google.com/file/d/1cTCaC7oJ9ZMPSax6I4ZHvUT-lqxOktrX/view?usp=sharing) (803 KB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
-  | Testing         | [link](https://drive.google.com/file/d/1ee227tHEO-DT8AkX7y2q6-bfAtUL-yMI/view?usp=sharing) (803 KB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
+  | Training (full) | [link](https://drive.google.com/file/d/1pbh8W3qgst9CD7nlPhsH9wmUSWjQlGdW/view?usp=sharing) (2.73 GB) | 3,779,763 | 70,723,513 | 167 khrs  | ~36 TB  |
+  | Training (10M)  | [link](https://drive.google.com/file/d/1LLOFeYw9nZzjT5aA1Wj4oGi5yHUzwSk5/view?usp=sharing) (504 MB)  | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB |
+  | Training (2M)   | [link](https://drive.google.com/file/d/1k7NzU6wVNZYl6NxOhLXE7Hz7OrpzNLgB/view?usp=sharing) (118 MB)  | 800,000   | 2,400,000  | 7.56 khrs | ~1.6 TB |
+  | Validation      | [link](https://drive.google.com/file/d/1uHR5iXS3Sftzw6AwEhyZ9RefipNzBAzt/view?usp=sharing) (1.2 MB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
+  | Testing         | [link](https://drive.google.com/file/d/1BZ9L-157Au1TwmkwlJV8nZQvSRLIiFhq/view?usp=sharing) (1.2 MB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
 
 More details can be found in [Dataset Dataloading](./dataset_dataloading) section.
 
@@ -106,11 +139,11 @@ Users must follow [the related license](https://raw.githubusercontent.com/micros
 If you find this project useful for your research, please cite our paper. :blush:
 
 ```bibtex
-@article{chen2024panda70m,
-    title   = {Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers},
-    author  = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Deyneka, Ekaterina and Chao, Hsiang-wei and Jeon, Byung Eun and Fang, Yuwei and Lee, Hsin-Ying and Ren, Jian and Yang, Ming-Hsuan and Tulyakov, Sergey},
-    journal = {arXiv preprint arXiv:2402.19479},
-    year    = {2024}
+@inproceedings{chen2024panda70m,
+  title     = {Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers},
+  author    = {Chen, Tsai-Shien and Siarohin, Aliaksandr and Menapace, Willi and Deyneka, Ekaterina and Chao, Hsiang-wei and Jeon, Byung Eun and Fang, Yuwei and Lee, Hsin-Ying and Ren, Jian and Yang, Ming-Hsuan and Tulyakov, Sergey},
+  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  year      = {2024}
 }
 ```
 

diff --git a/assets/14gEWADjcOI.6.gif b/assets/14gEWADjcOI.6.gif
diff --git a/assets/2VcOUDaJcnk.56.gif b/assets/2VcOUDaJcnk.56.gif
diff --git a/assets/2qUj6j7zLOQ.41.gif b/assets/2qUj6j7zLOQ.41.gif
diff --git a/assets/3W9ck1YVx2I.15.gif b/assets/3W9ck1YVx2I.15.gif
diff --git a/assets/SimjcPdKPkE.26.gif b/assets/SimjcPdKPkE.26.gif
diff --git a/assets/eOfsBLszShI.16.gif b/assets/eOfsBLszShI.16.gif
diff --git a/dataset_dataloading/README.md b/dataset_dataloading/README.md
@@ -6,11 +6,11 @@ The section includes the csv files listing the data samples in Panda-70M and the
 ## Data Splitting and Download Link
   | Split           | Download | # Source Videos | # Samples | Video Duration | Storage Space |
   |-----------------|----------|-----------------|-----------|----------------|---------------|
-  | Training (full) | [link](https://drive.google.com/file/d/1DeODUcdJCEfnTjJywM-ObmrlVg-wsvwz/view?usp=sharing) (2.01 GB) | 3,779,763 | 70,723,513 | 167 khrs  | ~36 TB  |
-  | Training (10M)  | [link](https://drive.google.com/file/d/1Lrsb65HTJ2hS7Iuy6iPCmjoc3abbEcAX/view?usp=sharing) (381 MB)  | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB |
-  | Training (2M)   | [link](https://drive.google.com/file/d/1jWTNGjb-hkKiPHXIbEA5CnFwjhA-Fq_Q/view?usp=sharing) (86.5 MB) | 800,000   | 2,400,000  | 7.56 khrs | ~1.6 TB |
-  | Validation      | [link](https://drive.google.com/file/d/1cTCaC7oJ9ZMPSax6I4ZHvUT-lqxOktrX/view?usp=sharing) (803 KB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
-  | Testing         | [link](https://drive.google.com/file/d/1ee227tHEO-DT8AkX7y2q6-bfAtUL-yMI/view?usp=sharing) (803 KB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
+  | Training (full) | [link](https://drive.google.com/file/d/1pbh8W3qgst9CD7nlPhsH9wmUSWjQlGdW/view?usp=sharing) (2.73 GB) | 3,779,763 | 70,723,513 | 167 khrs  | ~36 TB  |
+  | Training (10M)  | [link](https://drive.google.com/file/d/1LLOFeYw9nZzjT5aA1Wj4oGi5yHUzwSk5/view?usp=sharing) (504 MB)  | 3,755,240 | 10,473,922 | 37.0 khrs | ~8.0 TB |
+  | Training (2M)   | [link](https://drive.google.com/file/d/1k7NzU6wVNZYl6NxOhLXE7Hz7OrpzNLgB/view?usp=sharing) (118 MB)  | 800,000   | 2,400,000  | 7.56 khrs | ~1.6 TB |
+  | Validation      | [link](https://drive.google.com/file/d/1uHR5iXS3Sftzw6AwEhyZ9RefipNzBAzt/view?usp=sharing) (1.2 MB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
+  | Testing         | [link](https://drive.google.com/file/d/1BZ9L-157Au1TwmkwlJV8nZQvSRLIiFhq/view?usp=sharing) (1.2 MB)  | 2,000     | 6,000      | 18.5 hrs  | ~4.0 GB |
 - Validation and testing set are collected from 2,000 source videos which do not appear in any training set to avoid testing information leakage. For each source video, we randomly sample 3 clips.
 - Training set (10M) is the high-quality subset of training set (full). In the subset, we only sample at most 3 clips from a source video to increase diversity and the video-caption matching scores are all larger than 0.43 to guarantee a better caption quality.
 - Training set (2M) is randomly sampled from training set (10M) and include 3 clips for each source video.
@@ -33,7 +33,7 @@ video2dataset --url_list="<csv_file>" \
               --caption_col="caption" \
               --clip_col="timestamp" \
               --output_folder="<output_folder>" \
-              --save_additional_columns="[matching_score]" \
+              --save_additional_columns="[matching_score,desirable_filtering,shot_boundary_detection]" \
               --config="video2dataset/video2dataset/configs/panda70m.yaml"
 ```
 ### Known Issues
@@ -91,7 +91,12 @@ output-folder
  ...
 ```
 - Each data comes with 3 files: `.mp4` (video), `.txt` (caption), `.json` (meta information)
-- Meta information includes matching score (confidence score of each video-caption pair), caption, video title / description / categories / subtitles, to name but a few.
+- Meta information includes:
+  - Caption
+  - Matching score: confidence score of each video-caption pair
+  - **[🔥 New]** Desirablability filtering: whether a video is a suitable training sample for a video generation model. There are six categories of filtering results: `desirable`, `0_low_desirable_score`, `1_still_foreground_image`, `2_tiny_camera_movement`, `3_screen_in_screen`, `4_computer_screen_recording`. Check [here](https://github.com/snap-research/Panda-70M?tab=readme-ov-file#-updates-oct-2024) for examples for each category.
+  - **[🔥 New]** Shot boundary detection: a list of intervals representing continuous shots within a video (predicted by [TransNetV2](https://github.com/soCzech/TransNetV2)). If the length of the list is one, it indicates the video consists of a single continuous shot without any shot boundaries.
+  - Other metadata: video title, description, categories, subtitles, to name but a few.
 - **[Note 1]** The dataset is unshuffled and the clips from a same long video would be stored into a shard. Please manually shuffle them if needed.
 - **[Note 2]** The videos are resized into 360 px height. You can change `download_size` in the [config](./video2dataset/video2dataset/configs/panda70m.yaml) file to get different video resolutions.
 - **[Note 3]** The videos are downloaded with audio by default. You can change `download_audio` in the [config](./video2dataset/video2dataset/configs/panda70m.yaml) file to turn off the audio and increase download speed.

diff --git a/dataset_dataloading/video2dataset/build/lib/video2dataset/workers/download_worker.py b/dataset_dataloading/video2dataset/build/lib/video2dataset/workers/download_worker.py
@@ -256,6 +256,10 @@ def data_generator():
                             meta["caption"] = text_caption
                             if "matching_score" in meta:
                                 meta["matching_score"] = str(eval(meta["matching_score"])[i])
+                            if "desirable_filtering" in meta:
+                                meta["desirable_filtering"] = eval(meta["desirable_filtering"])[i]
+                            if "shot_boundary_detection" in meta:
+                                meta["shot_boundary_detection"] = str(eval(meta["shot_boundary_detection"])[i])
 
                         if self.config["storage"]["captions_are_subtitles"]:
                             text_caption = meta.get("clip_subtitles")[0]["lines"][0]

diff --git a/dataset_dataloading/video2dataset/video2dataset/workers/download_worker.py b/dataset_dataloading/video2dataset/video2dataset/workers/download_worker.py
@@ -256,6 +256,10 @@ def data_generator():
                             meta["caption"] = text_caption
                             if "matching_score" in meta:
                                 meta["matching_score"] = str(eval(meta["matching_score"])[i])
+                            if "desirable_filtering" in meta:
+                                meta["desirable_filtering"] = eval(meta["desirable_filtering"])[i]
+                            if "shot_boundary_detection" in meta:
+                                meta["shot_boundary_detection"] = str(eval(meta["shot_boundary_detection"])[i])
 
                         if self.config["storage"]["captions_are_subtitles"]:
                             text_caption = meta.get("clip_subtitles")[0]["lines"][0]
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		Copyright Snap Inc. 2024. This dataset is made available by Snap Inc. for informational purposes only. No license, whether implied or otherwise, is granted in or to such dataset (including any rights to copy, modify, publish, distribute and/or commercialize such dataset), unless you have entered into a separate agreement for such rights. Such dataset is provided as-is, without warranty of any kind, express or implied, including any warranties of merchantability, title, fitness for a particular purpose, non-infringement, or that such dataset is free of defects, errors or viruses. In no event will Snap Inc. be liable for any damages or losses of any kind arising from the dataset or your use thereof.
		Copyright (c) 2024 Snap Inc. All rights reserved. This dataset and code is made available by Snap Inc. for non-commercial, research purposes only. Non-commercial means not primarily intended for or directed towards commercial advantage or monetary compensation. Research purposes mean solely for study, instruction, or non-commercial research, testing or validation. No commercial license, whether implied or otherwise, is granted in or to this dataset and code, unless you have entered into a separate agreement with Snap Inc. for such rights. This dataset and code is provided as-is, without warranty of any kind, express or implied, including any warranties of merchantability, title, fitness for a particular purpose, non-infringement, or that the code is free of defects, errors or viruses. In no event will Snap Inc. be liable for any damages or losses of any kind arising from this dataset and code or your use thereof. Any redistribution of this dataset and code must retain or reproduce the above copyright notice, conditions and disclaimer.