Search Results for author: Mike Zheng Shou

Found 19 papers, 12 papers with code

GEB+: A benchmark for generic event boundary captioning, grounding and text-based retrieval

no code implementations1 Apr 2022 Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou

Cognitive science has shown that humans perceive videos in terms of events separated by state changes of dominant subjects.

Unified Transformer Tracker for Object Tracking

1 code implementation29 Mar 2022 Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, Zhicheng Yan

Although UniTrack \cite{wang2021different} demonstrates that a shared appearance model with multiple heads can be used to tackle individual tracking tasks, it fails to exploit the large-scale tracking datasets for training and performs poorly on single object tracking.

Frame Multiple Object Tracking

Revitalize Region Feature for Democratizing Video-Language Pre-training

2 code implementations15 Mar 2022 Guanyu Cai, Yixiao Ge, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, XiaoHu Qie, Jianping Wu, Mike Zheng Shou

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language tasks.

Question Answering Text to Video Retrieval +3

All in One: Exploring Unified Video-Language Pre-training

1 code implementation14 Mar 2022 Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Frame Language Modelling +11

AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant

1 code implementation8 Mar 2022 Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou

In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos and scripts to guide the user step-by-step.

Contrastive Learning of Semantic and Visual Representations for Text Tracking

no code implementations30 Dec 2021 Zhuang Li, Weijia Wu, Mike Zheng Shou, Jiahong Li, Size Li, Zhongyuan Wang, Hong Zhou

Semantic representation is of great benefit to the video text tracking(VTT) task that requires simultaneously classifying, detecting, and tracking texts in the video.

Contrastive Learning

Video-Text Pre-training with Learned Regions

no code implementations2 Dec 2021 Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.

Frame Representation Learning +1

Object-aware Video-language Pre-training for Retrieval

1 code implementation1 Dec 2021 Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.

Text Matching

AVA-AVD: Audio-visual Speaker Diarization in the Wild

1 code implementation29 Nov 2021 Eric Zhongcong Xu, Zeyang Song, Chao Feng, Mang Ye, Mike Zheng Shou

To create a testbed that can effectively compare diarization methods on videos in the wild, we annotate the speaker diarization labels on the AVA movie dataset and create a new benchmark called AVA-AVD.

Speaker Diarization

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

1 code implementation24 Nov 2021 David Junhao Zhang, Kunchang Li, Yunpeng Chen, Yali Wang, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou

Self-attention has become an integral component of the recent network architectures, e. g., Transformer, that dominate major image and video benchmarks.

Ranked #11 on Action Recognition on Something-Something V2 (using extra training data)

Action Recognition Image Classification +1

Ego4D: Around the World in 3,000 Hours of Egocentric Video

no code implementations13 Oct 2021 Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.


On Pursuit of Designing Multi-modal Transformer for Video Grounding

no code implementations EMNLP 2021 Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou

Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression.

Generic Event Boundary Detection: A Benchmark for Event Segmentation

2 code implementations ICCV 2021 Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, Matt Feiszli

This paper presents a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks.

Action Detection Boundary Detection +3

Channel Augmented Joint Learning for Visible-Infrared Recognition

no code implementations ICCV 2021 Mang Ye, Weijian Ruan, Bo Du, Mike Zheng Shou

This paper introduces a powerful channel augmented joint learning strategy for the visible-infrared recognition problem.

Data Augmentation Metric Learning

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

2 code implementations CVPR 2021 Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, Hongsheng Li

We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context.

Action Detection Action Recognition +2

Cannot find the paper you are looking for? You can Submit a new open access paper.