no code implementations • 21 Jun 2023 • YuHan Shen, Linjie Yang, Longyin Wen, Haichao Yu, Ehsan Elhamifar, Heng Wang
Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +2
1 code implementation • CVPR 2022 • YuHan Shen, Ehsan Elhamifar
To compute the SRE loss, we develop a flexible transcript prediction (FTP) method that uses the output of the action classifier to find both the length of the transcript and the sequence of actions occurring in an unlabeled video.
1 code implementation • CVPR 2021 • YuHan Shen, Lu Wang, Ehsan Elhamifar
We address the problem of unsupervised localization of key-steps and feature learning in instructional videos using both visual and language instructions.