Data loaders for pretrain and downstream tasks (retrieval and caption).
For pretrain, you need to prepare 3 parts,
Download raw videos from the HowTo100M webpage and extract s3d (howto100m) features. You can refer to VideoFeatureExtractor.
Note: this file is different from HowTo100M_v1.csv as in README.txt
The csv format contains two columns. The first column is the video id, and the second is the feature file (sub-path of the npy, which will post append to --features_path
(refer to pretrain part in README) to find the npy file when reading).
video_id,feature_file
Z8xhli297v8,Z8xhli297v8.npy
...
video_id: used to match the caption or transcript
feature_file: used to find the feature file after joining with --features_path
This pickle file is generated from raw_caption.json in raw_caption.zip introduced in README.txt
The format of this file is:
{
'video_id 1':{
'start': array([0.08, 7.37, 15.05, ...], dtype=object),
'end': array([9.96, 16.98, 27.9, ...], dtype=object),
'text': array(['sentence 1 placehodolder',
'sentence 2 placehodolder',
'sentence 3 placehodolder', ...], dtype=object)
},
...
}
Keep the start
is a sorted array.
The s3d feature extraction is the same as HowTo100M introduced above.
This file is generated from youcookii_annotations_trainval.json
, which can be downloaded from official webpage.
The format of this file is (similar to caption.pickle
introduced above, but one more key transcript
. The transcript
needs to generated by extra ASR tool from speech.):
{
'video_id 1':{
'start': array([0.08, 7.37, 15.05, ...], dtype=object),
'end': array([9.96, 16.98, 27.9, ...], dtype=object),
'text': array(['sentence 1 placehodolder',
'sentence 2 placehodolder',
'sentence 3 placehodolder', ...], dtype=object)
'transcript': array(['transcript 1 placehodolder',
'transcript 2 placehodolder',
'transcript 3 placehodolder', ...], dtype=object)
},
...
}
If you want to test on retrieval or caption w/o transcript tasks, you can set transcript
with array(['NONE', 'NONE', 'NONE', ...], dtype=object)
.
video_id,feature_file
Z8xhli297v8,Z8xhli297v8
...
Note: The video_id and feature_file are the same for the consistency and our historical compatibility. We use feature_file to get the feature from feature pickle.
The s3d feature extraction is the same as HowTo100M introduced above. The data can be downloaded in: https://github.com/microsoft/UniVL/releases/download/v0/msrvtt.zip