NEW: How2 will be part of the IWSLT 2019 speech to text evaluation.
NEW: Unsupervised Object Recognition Grounding labels for development and test set! check here.
NEW: Please check out the How2 Challenge Workshop at ICML 2019.
This repository contains code and meta-data to (re-)create the How2 dataset as described in the following paper:
Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, and Florian Metze. How2: A large-scale dataset for multimodal language understanding. In Proceedings Visually Grounded Interaction and Language (ViGIL), Montreal; Canada, December 2018. Neural Information Processing Society (NeurIPS). https://arxiv.org/abs/1811.00347.
Please cite the following paper in all academic work that uses this dataset:
@inproceedings{sanabria18how2,
title = {{How2:} A Large-scale Dataset For Multimodal Language Understanding},
author = {Sanabria, Ramon and Caglayan, Ozan and Palaskar, Shruti and Elliott, Desmond and Barrault, Lo\"ic and Specia, Lucia and Metze, Florian},
booktitle = {Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL)},
year = {2018},
organization={NeurIPS},
url = {http://arxiv.org/abs/1811.00347}
}
We also acknowledge earlier work (including first-time data collection) on the same data (and encourage you to do the same):
- Shoou-I Yu, Lu Jiang, and Alexander Hauptmann. Instructional Videos for Unsupervised Harvesting and Learning of Action Examples. In Proc. 22nd ACM International Conference on Multimedia; Orlando, FL; U.S.A.; 2014. ACM. http://doi.acm.org/10.1145/2647868.2654997.
More papers can be found in the bibliography.
To subscribe to the How2 mailing list click here.
The corpus consists of around 80,000 instructional videos (about 2,000 hours) with associated English sub-titles and summaries. About 300 hours have also been translated into Portuguese using crowd-sourcing, and used during the JSALT 2018 Workshop. We are working on releasing a larger version - please let us know if you are interested.
You can obtain the corpus in one of two ways:
You can download a pre-packaged version of all the necessary files by filling in a form.
To receive the Portuguese translations, please fill in this form and request the "translation package".
The aforementioned form will ask for the following informations:
- confirm that you understand that the download is a short-cut to getting the features only
- ask if you want the video features, audio features, text features, or the Portuguese translations (select one or many)
- then it will give you instructions on how to obtain the desired data
Proceed from here
The results in the dataset paper can be reproduced using nmtpytorch. We provide instructions and configuration files to reproduce three baselines on multi-modal speech-to-text, multi-modal machine translation, and multi-modal summarization.
Here are instructions on how to score and submit our three challenge tasks. Report scores on the test
set for each task.
Dependencies:
You can find a hypothesis example corresponding to the system with 18.4 WER
of Table 2 of 2019 Caglayan et al. in ./test/hyp.filtered.word.wer.r5f045.max150.dev5.beam10.sclite
.
To score this hypothesis, you should execute the following command
sclite -r ./eval/asr/dev5.filtered.en -h ./eval/asr/hyp.filtered.word.wer.r5f045.max150.dev5.beam10.sclite -i spu_id -f 0 -o sum stdout dtl pra | grep Sum/Avg | awk '{print $11}'
It should return
18.4
We use the nmtpy-coco-metrics package for this evaluation. Report the BLEU scores for MT and Rouge-L scores for summarization.
Installation steps:
- pip install nmtpytorch
This will install all packages needed for this evaluation. You can find a demo hypothesis file from the summarization models described in Table 2 of Libovicky et al. 2018 in ./eval/summarization/demo.test.hypothesis
.
To score this hypothesis, you should execute the following command
nmtpy-coco-metrics ./eval/summarization/demo.test.hypothesis -r ./eval/summarization/demo.test.reference
It should return (Rouge-L)
31.2
Please use the issues
ticket system (https://github.com/srvk/how2-dataset/issues) to ask questions and get clarification. You can also send email.
License information for every video can be found in the .info.json
file that is being downloaded for every video.
At the time of release, all videos included in this dataset were being made available by the original content providers under the standard YouTube License.
Unless noted otherwise, we are providing the contents of this repository under the Creative Commons BY-SA 4.0 (Attribution-Share-Alike) License (for data-like content) and/ or BSD-2-Clause License (for software-type content).