Search Results for author: Mike Zheng Shou

Found 153 papers, 95 papers with code

MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation

no code implementations20 Apr 2025 Siyi Jiao, Wenzheng Zeng, Yerong Li, Huayu Zhang, Changxin Gao, Nong Sang, Mike Zheng Shou

In this work, we address this by introducing MP-Mat, a novel 3D-and-instance-aware matting framework with multiplane representation, where the multiplane concept is designed from two different perspectives: scene geometry level and instance level.

Image Matting

AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis

no code implementations27 Mar 2025 Zhiwei Yang, Chen Gao, Jing Liu, Peng Wu, Guansong Pang, Mike Zheng Shou

To bridge this gap and facilitate the practical deployment of LLM-based VAD, we introduce AssistPDA, the first online video anomaly surveillance assistant that unifies video anomaly prediction, detection, and analysis (VAPDA) within a single framework.

Anomaly Detection Anomaly Forecasting +2

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

1 code implementation25 Mar 2025 YuChao Gu, Weijia Mao, Mike Zheng Shou

Long-context autoregressive modeling has significantly advanced language generation, but video generation still struggles to fully utilize extended temporal contexts.

Text Generation Video Generation

Impossible Videos

no code implementations18 Mar 2025 Zechen Bai, Hai Ci, Mike Zheng Shou

In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge.

counterfactual Video Generation +2

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

1 code implementation17 Mar 2025 Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence.

Grounded Video Question Answering Temporal Localization +2

Edit Transfer: Learning Image Editing via Vision In-Context Relations

no code implementations17 Mar 2025 Lan Chen, Qi Mao, YuChao Gu, Mike Zheng Shou

We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image.

In-Context Learning Relation +1

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

1 code implementation12 Mar 2025 Kevin Qinghong Lin, Mike Zheng Shou

Human daily activities can be concisely narrated as sequences of routine events (e. g., turning off an alarm) in video streams, forming an event vocabulary.

EgoSchema Retrieval +1

TPDiff: Temporal Pyramid Video Diffusion Model

no code implementations12 Mar 2025 Lingmin Ran, Mike Zheng Shou

Given the inter-frame redundancy in video modality, maintaining full frame rates in high-entropy stages is unnecessary.

Computational Efficiency model

In-Context Defense in Computer Agents: An Empirical Study

no code implementations12 Mar 2025 Pei Yang, Hai Ci, Mike Zheng Shou

Our findings highlight that (1) defensive reasoning must precede action planning for optimal performance, and (2) a minimal number of exemplars (fewer than three) is sufficient to induce an agent's defensive behavior.

In-Context Learning

Balanced Image Stylization with Style Matching Score

no code implementations10 Mar 2025 Yuxin Jiang, Liming Jiang, Shuai Yang, Jia-Wei Liu, Ivor Tsang, Mike Zheng Shou

We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models.

Image Stylization

Automated Movie Generation via Multi-Agent CoT Planning

1 code implementation10 Mar 2025 Weijia Wu, Zeyu Zhu, Mike Zheng Shou

Existing long-form video generation frameworks lack automated planning, requiring manual input for storylines, scenes, cinematography, and character interactions, resulting in high costs and inefficiencies.

Video Generation

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

1 code implementation5 Mar 2025 Rui Zhao, Weijia Mao, Mike Zheng Shou

For tasks involving new paired knowledge, such as specific identities, a combination of a small set of paired image-text examples and larger-scale unpaired data is sufficient for effective domain-oriented adaptation.

Domain Adaptation Image to text

Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

no code implementations3 Mar 2025 Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling

At the core of our approach is Difix, a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by underconstrained regions of the 3D representation.

3DGS 3D Reconstruction +2

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

no code implementations20 Feb 2025 Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou

Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users, which is vital for developing general-purpose AI assistants.

PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data

1 code implementation20 Feb 2025 Shijie Huang, Yiren Song, Yuxuan Zhang, Hailong Guo, Xueyin Wang, Mike Zheng Shou, Jiaming Liu

Subsequently, we fine-tune this model with EditLoRA using a small, artist-curated dataset of before-and-after image pairs to capture distinct editing styles and techniques.

Style Transfer

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

no code implementations17 Feb 2025 Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu

These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models.

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

1 code implementation12 Feb 2025 Henry Hengyuan Zhao, Difei Gao, Mike Zheng Shou

In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions.

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

1 code implementation3 Feb 2025 Yiren Song, Danze Chen, Mike Zheng Shou

Generating cognitive-aligned layered SVGs remains challenging due to existing methods' tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies.

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

1 code implementation19 Dec 2024 Yiren Song, Xiaokang Liu, Mike Zheng Shou

Diffusion models have fundamentally transformed the field of generative models, making the assessment of similarity between customized model outputs and reference inputs critically important.

Contrastive Learning Denoising +3

Anti-Reference: Universal and Immediate Defense Against Reference-Based Generation

1 code implementation8 Dec 2024 Yiren Song, Shengtao Lou, Xiaokang Liu, Hai Ci, Pei Yang, Jiaming Liu, Mike Zheng Shou

Diffusion models have revolutionized generative modeling with their exceptional ability to produce high-fidelity images.

ROICtrl: Boosting Instance Control for Visual Generation

no code implementations27 Nov 2024 YuChao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou

Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances.

Attribute object-detection +1

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

1 code implementation26 Nov 2024 Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances.

Instruction Following Natural Language Visual Grounding +1

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

1 code implementation22 Nov 2024 Weijia Wu, MingYu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou

Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos.

Video Generation

FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data

1 code implementation22 Nov 2024 Binqian Xu, Xiangbo Shu, Haiyang Mei, GuoSen Xie, Basura Fernando, Mike Zheng Shou, Jinhui Tang

Fine-tuning MLLMs with Federated Learning (FL) allows for expanding the training data scope by including private data sources, thereby enhancing their practical applicability in privacy-sensitive domains.

Federated Learning

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

1 code implementation15 Nov 2024 Siyuan Hu, Mingyu Ouyang, Difei Gao, Mike Zheng Shou

The recently released model, Claude 3. 5 Computer Use, stands out as the first frontier AI model to offer computer use in public beta as a graphical user interface (GUI) agent.

Skinned Motion Retargeting with Dense Geometric Interaction Perception

1 code implementation28 Oct 2024 Zijie Ye, Jia-Wei Liu, Jia Jia, Shikun Sun, Mike Zheng Shou

Subsequently, we develop a novel spatio-temporal representation called the dense mesh interaction (DMI) field.

motion retargeting

ControLRM: Fast and Controllable 3D Generation via Large Reconstruction Model

no code implementations12 Oct 2024 Hongbin Xu, Weitao Chen, Zhipeng Zhou, Feng Xiao, Baigui Sun, Mike Zheng Shou, Wenxiong Kang

In the condition training branch, we lock the triplane decoder and reuses the deep and robust encoding layers pretrained with millions of 3D data in LRM.

3D Generation Decoder

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

1 code implementation9 Oct 2024 Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, YuChao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou

Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability.

Text-to-Image Generation

Image Watermarks are Removable Using Controllable Regeneration from Clean Noise

1 code implementation7 Oct 2024 Yepeng Liu, Yiren Song, Hai Ci, Yu Zhang, Haofan Wang, Mike Zheng Shou, Yuheng Bu

This scheme adds varying numbers of noise steps to the latent representation of the watermarked image, followed by a controlled denoising process starting from this noisy latent representation.

Attribute Denoising

Unsupervised Prior Learning: Discovering Categorical Pose Priors from Videos

no code implementations4 Oct 2024 Ziyu Wang, Shuangpeng Han, Mike Zheng Shou, Mengmi Zhang

PPL uses a hierarchical memory to store compositional parts of prototypical poses, from which we distill a general pose prior.

Animal Pose Estimation Decision Making +1

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

1 code implementation29 Sep 2024 Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou

We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos.

All Image Segmentation +12

High Quality Human Image Animation using Regional Supervision and Motion Blur Condition

no code implementations29 Sep 2024 Zhongcong Xu, Chaoyue Song, Guoxian Song, Jianfeng Zhang, Jun Hao Liew, Hongyi Xu, You Xie, Linjie Luo, Guosheng Lin, Jiashi Feng, Mike Zheng Shou

Although generating reasonable results, existing methods often overlook the need for regional supervision in crucial areas such as the face and hands, and neglect the explicit modeling for motion blur, leading to unrealistic low-quality synthesis.

Human Animation Image Animation

DOTA: Distributional Test-Time Adaptation of Vision-Language Models

no code implementations28 Sep 2024 Zongbo Han, Jialong Yang, Junfan Li, QinGhua Hu, Qianli Xu, Mike Zheng Shou, Changqing Zhang

Instead of naively memorizing representative test samples, Dota continually estimates the distributions of test samples, allowing the model to continually adapt to the deployment environment.

Test-time Adaptation

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

no code implementations29 Aug 2024 Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou

Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.

Learning Video Context as Interleaved Multimodal Sequences

1 code implementation31 Jul 2024 Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts.

Language Modeling Language Modelling +7

GUI Action Narrator: Where and When Did That Action Take Place?

no code implementations19 Jun 2024 Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks.

Optical Character Recognition (OCR) Video Captioning

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

no code implementations14 Jun 2024 Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks.

Video Editing

Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious?

no code implementations13 Jun 2024 Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou

Digital watermarking techniques are crucial for copyright protection and source identification of images, especially in the era of generative AI models.

Steganalysis

ProcessPainter: Learn Painting Process from Sequence Data

1 code implementation10 Jun 2024 Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci, Jiaming Liu, Yuxuan Zhang, Mike Zheng Shou

The painting process of artists is inherently stepwise and varies significantly among different painters and styles.

Denoising Image Generation

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

1 code implementation4 Jun 2024 Alex Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou

Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs.

document understanding Retrieval

Multi-Modal Generative Embedding Model

no code implementations29 May 2024 Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun

Existing models usually tackle these two types of problems by decoupling language modules into a text decoder for generation, and a text encoder for embedding.

Caption Generation Cross-Modal Retrieval +9

LOVA3: Learning to Visual Question Answering, Asking and Assessment

1 code implementation23 May 2024 Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou

Our results demonstrate consistent performance gains, underscoring the critical role of these additional tasks in fostering comprehensive intelligence in MLLMs.

Question Answering Visual Question Answering

Hallucination of Multimodal Large Language Models: A Survey

1 code implementation29 Apr 2024 Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, Mike Zheng Shou

By drawing the granular classification and landscapes of hallucination causes, evaluation benchmarks, and mitigation methods, this survey aims to deepen the understanding of hallucinations in MLLMs and inspire further advancements in the field.

Hallucination Survey

Learning Long-form Video Prior via Generative Pre-Training

1 code implementation24 Apr 2024 Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT.

Form

RingID: Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification

1 code implementation22 Apr 2024 Hai Ci, Pei Yang, Yiren Song, Mike Zheng Shou

We revisit Tree-Ring Watermarking, a recent diffusion model watermarking method that demonstrates great robustness to various attacks.

DragAnything: Motion Control for Anything using Entity Representation

2 code implementations12 Mar 2024 Weijia Wu, Zhuang Li, YuChao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Di Zhang

We introduce DragAnything, which utilizes a entity representation to achieve motion control for any object in controllable video generation.

Object Video Generation

Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters

1 code implementation21 Feb 2024 Zechen Bai, Peng Chen, Xiaolan Peng, Lu Liu, Hui Chen, Mike Zheng Shou, Feng Tian

In our solution, a deep learning model was first trained to retarget the facial expression from input face images to virtual human faces by estimating the blendshape coefficients.

Unity

Skip \n: A Simple Method to Reduce Hallucination in Large Vision-Language Models

2 code implementations2 Feb 2024 Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, Mike Zheng Shou

Recent advancements in large vision-language models (LVLMs) have demonstrated impressive capability in visual information understanding with human language.

Hallucination

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

2 code implementations3 Jan 2024 David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo

This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text.

Image Animation Video Editing +1

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

no code implementations CVPR 2024 Jingtao Sun, Yaonan Wang, Mingtao Feng, Yulan Guo, Ajmal Mian, Mike Zheng Shou

To this end we present a generic Language-to-4D modeling paradigm termed L4D-Track that tackles zero-shot 6-DoF \underline Track ing and shape reconstruction by learning pairwise implicit 3D representation and multi-level multi-modal alignment.

3D Shape Reconstruction Pose Tracking

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

no code implementations1 Jan 2024 Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, JianFeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

\ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters.

Language Modelling Reading Comprehension +1

Parrot Captions Teach CLIP to Spot Text

1 code implementation21 Dec 2023 Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias.

Representation Learning text similarity +1

ShowRoom3D: Text to High-Quality 3D Room Generation Using 3D Priors

no code implementations20 Dec 2023 Weijia Mao, Yan-Pei Cao, Jia-Wei Liu, Zhongcong Xu, Mike Zheng Shou

Previous methods using 2D diffusion priors to optimize neural radiance fields for generating room-scale scenes have shown unsatisfactory quality.

NeRF

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

1 code implementation20 Dec 2023 Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou

Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity.

Language Modelling Large Language Model

MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance

no code implementations18 Dec 2023 Qi Mao, Lan Chen, YuChao Gu, Zhen Fang, Mike Zheng Shou

Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with simple compositions.

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

1 code implementation11 Dec 2023 Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou

Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-solving capabilities, but few research studies aim to gauge the ability to generate visual instruction tuning data.

Image Captioning Question Answering +1

Bootstrapping SparseFormers from Vision Foundation Models

1 code implementation CVPR 2024 Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou

In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.

ColonNeRF: High-Fidelity Neural Reconstruction of Long Colonoscopy

no code implementations4 Dec 2023 Yufei Shi, Beijia Lu, Jia-Wei Liu, Ming Li, Mike Zheng Shou

Specifically, to reconstruct the entire colon in a piecewise manner, our ColonNeRF introduces a region division and integration module, effectively reducing shape dissimilarity and ensuring geometric consistency in each segment.

NeRF Neural Rendering +1

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

no code implementations CVPR 2024 YuChao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang

In contrast to previous methods that rely on dense correspondences, we introduce the VideoSwap framework that exploits semantic point correspondences, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape.

Video Editing

DeformGS: Scene Flow in Highly Deformable Scenes for Deformable Object Manipulation

no code implementations30 Nov 2023 Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia-Wei Liu, Jenny Seidenschwarz, Mike Zheng Shou, Deva Ramanan, Shuran Song, Stan Birchfield, Bowen Wen, Jeffrey Ichnowski

Just as object pose estimation and tracking have aided robots for rigid manipulation, dense 3D tracking (scene flow) of highly deformable objects will enable new applications in robotics while aiding existing approaches, such as imitation learning or creating digital twins with real2sim transfer.

Deformable Object Manipulation Imitation Learning +2

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

2 code implementations CVPR 2024 Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.

Video Understanding

MLLMs-Augmented Visual-Language Representation Learning

1 code implementation30 Nov 2023 Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You

Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets.

Image-text Retrieval Representation Learning +1

Continual Learning for Image Segmentation with Dynamic Query

1 code implementation29 Nov 2023 Weijia Wu, Yuzhong Zhao, Zhuang Li, Lianlei Shan, Hong Zhou, Mike Zheng Shou

Image segmentation based on continual learning exhibits a critical drop of performance, mainly due to catastrophic forgetting and background shift, as they are required to incorporate new classes continually.

Continual Learning Diversity +6

ViT-Lens: Towards Omni-modal Representations

1 code implementation CVPR 2024 Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space.

EEG Image Generation +2

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

1 code implementation24 Nov 2023 Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang

In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.

Image Generation Language Modeling +2

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

no code implementations CVPR 2024 Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, YuChao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou

To overcome this, we propose to introduce the dynamic Neural Radiance Fields (NeRF) as the innovative video representation, where the editing can be performed in the 3D spaces and propagated to the entire video via the deformation field.

NeRF Style Transfer +2

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

1 code implementation12 Oct 2023 Rui Zhao, YuChao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou

Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion.

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

1 code implementation27 Sep 2023 David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, YuChao Gu, Difei Gao, Mike Zheng Shou

In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.

Text-to-Video Generation Video Alignment +1

Unsupervised Open-Vocabulary Object Localization in Videos

1 code implementation ICCV 2023 Ke Fan, Zechen Bai, Tianjun Xiao, Dominik Zietlow, Max Horn, Zixu Zhao, Carl-Johann Simon-Gabriel, Mike Zheng Shou, Francesco Locatello, Bernt Schiele, Thomas Brox, Zheng Zhang, Yanwei Fu, Tong He

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization.

Object Object Localization +1

Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks

no code implementations18 Sep 2023 Zeyang Song, Jibin Wu, Malu Zhang, Mike Zheng Shou, Haizhou Li

Brain-inspired spiking neural networks (SNNs) have demonstrated great potential for temporal signal processing.

Keyword Spotting Speaker Identification

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

2 code implementations15 Sep 2023 Henry Hengyuan Zhao, Pichao Wang, Yuyang Zhao, Hao Luo, Fan Wang, Mike Zheng Shou

Experiments on 19 visual transfer learning downstream tasks demonstrate that our SCT outperforms full fine-tuning on 18 out of 19 tasks by adding only 0. 11M parameters of the ViT-B, which is 780$\times$ fewer than its full fine-tuning counterpart.

Domain Generalization Few-Shot Learning +2

Dataset Condensation via Generative Model

no code implementations14 Sep 2023 David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou

Dataset condensation aims to condense a large dataset with a lot of training samples into a small set.

Dataset Condensation model

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

1 code implementation20 Aug 2023 Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou

A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities.

3D Classification Question Answering +4

Recap: Detecting Deepfake Video with Unpredictable Tampered Traces via Recovering Faces and Mapping Recovered Faces

no code implementations19 Aug 2023 Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou

In the recovering stage, the model focuses on randomly masking regions of interest (ROIs) and reconstructing real faces without unpredictable tampered traces, resulting in a relatively good recovery effect for real faces while a poor recovery effect for fake faces.

DeepFake Detection Face Swapping

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

no code implementations13 Aug 2023 David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou

Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy.

Contrastive Learning Image Classification +2

Revisiting Vision Transformer from the View of Path Ensemble

no code implementations ICCV 2023 Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou

Therefore, we propose the path pruning and EnsembleScale skills for improvement, which cut out the underperforming paths and re-weight the ensemble components, respectively, to optimize the path combination and make the short paths focus on providing high-quality representation for subsequent paths.

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

1 code implementation NeurIPS 2023 Weijia Wu, Yuzhong Zhao, Hao Chen, YuChao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen

To showcase the power of the proposed approach, we generate datasets with rich dense pixel-wise labels for a wide range of downstream tasks, including semantic segmentation, instance segmentation, and depth estimation.

Dataset Generation Decoder +7

UniVTG: Towards Unified Video-Language Temporal Grounding

1 code implementation ICCV 2023 Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.

Highlight Detection Moment Retrieval +3

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

2 code implementations ICCV 2023 Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou

As such paired data is time-consuming and labor-intensive to acquire and restricted to a closed set, this potentially becomes the bottleneck for applications in an open world.

Conditional Text-to-Image Synthesis Denoising

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

1 code implementation27 Jun 2023 Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou

Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data.

Natural Language Queries

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

no code implementations22 Jun 2023 Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou

In situations involving system upgrades that require updating the upstream foundation model, it becomes essential to re-train all downstream modules to adapt to the new foundation model, which is inflexible and inefficient.

Question Answering Text Retrieval +4

Too Large; Data Reduction for Vision-Language Pre-Training

2 code implementations ICCV 2023 Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou

Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e. g., reduce well-cleaned CC3M dataset from 2. 82M to 0. 67M ($\sim$24\%) and noisy YFCC15M from 15M to 2. 5M ($\sim$16. 7\%).

Decoder

VisorGPT: Learning Visual Prior via Generative Pre-Training

1 code implementation23 May 2023 Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou

Experimental results demonstrate that VisorGPT can effectively model the visual prior, which can be employed for many vision tasks, such as customizing accurate human pose for conditional image synthesis models like ControlNet.

Image Generation Language Modeling +2

A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension

1 code implementation5 May 2023 Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Hong Zhou, Mike Zheng Shou, Xiang Bai

Most existing cross-modal language-to-video retrieval (VR) research focuses on single-modal input from video, i. e., visual representation, while the text is omnipresent in human environments and frequently critical to understand video.

Reading Comprehension Retrieval +2

Open-World Weakly-Supervised Object Localization

1 code implementation17 Apr 2023 Jinheng Xie, Zhaochuan Luo, Yuexiang Li, Haozhe Liu, Linlin Shen, Mike Zheng Shou

To handle such data, we propose a novel paradigm of contrastive representation co-learning using both labeled and unlabeled data to generate a complete G-CAM (Generalized Class Activation Map) for object localization, without the requirement of bounding box annotation.

Object Representation Learning +1

ICDAR 2023 Video Text Reading Competition for Dense and Small Text

no code implementations10 Apr 2023 Weijia Wu, Yuzhong Zhao, Zhuang Li, Jiahong Li, Mike Zheng Shou, Umapada Pal, Dimosthenis Karatzas, Xiang Bai

In this competition report, we establish a video text reading benchmark, DSText, which focuses on dense and small text reading challenges in the video with various scenarios.

Task 2 Text Detection +2

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

1 code implementation7 Apr 2023 Ziteng Gao, Zhan Tong, LiMin Wang, Mike Zheng Shou

In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.

Sparse Representation-based Classification Video Classification

Making Vision Transformers Efficient from A Token Sparsification View

1 code implementation CVPR 2023 Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong Jin, Mike Zheng Shou

In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks.

Efficient ViTs Instance Segmentation +4

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm

no code implementations14 Mar 2023 Hengyuan Zhao, Hao Luo, Yuyang Zhao, Pichao Wang, Fan Wang, Mike Zheng Shou

In view of the practicality of PETL, previous works focus on tuning a small set of parameters for each downstream task in an end-to-end manner while rarely considering the task distribution shift issue between the pre-training task and the downstream task.

Transfer Learning Vocal Bursts Valence Prediction

Mover: Mask and Recovery based Facial Part Consistency Aware Method for Deepfake Video Detection

no code implementations3 Mar 2023 Juan Hu, Xin Liao, Difei Gao, Satoshi Tsutsui, Qian Wang, Zheng Qin, Mike Zheng Shou

Specifically, given a real face image, we first pretrain a masked autoencoder to learn facial part consistency by dividing faces into three parts and randomly masking ROIs, which are then recovered based on the unmasked facial parts.

DeepFake Detection Face Swapping

Object-centric Learning with Cyclic Walks between Parts and Whole

1 code implementation NeurIPS 2023 Ziyu Wang, Mike Zheng Shou, Mengmi Zhang

To capture compositional entities of the scene, we proposed cyclic walks between perceptual features extracted from vision transformers and object entities.

Decoder Object

STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition

no code implementations ICCV 2023 Ming Li, Xiangyu Xu, Hehe Fan, Pan Zhou, Jun Liu, Jia-Wei Liu, Jiahe Li, Jussi Keppo, Mike Zheng Shou, Shuicheng Yan

For the first time, we introduce vision Transformers into PPAR by treating a video as a tubelet sequence, and accordingly design two complementary mechanisms, i. e., sparsification and anonymization, to remove privacy from a spatio-temporal perspective.

Action Recognition Facial Expression Recognition (FER) +2

Position-guided Text Prompt for Vision-Language Pre-training

1 code implementation CVPR 2023 Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan

In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP.

Cross-Modal Retrieval Image Captioning +6

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

1 code implementation CVPR 2023 Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must.

Form Question Answering +3

PV3D: A 3D Generative Model for Portrait Video Generation

no code implementations13 Dec 2022 Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Wenqing Zhang, Song Bai, Jiashi Feng, Mike Zheng Shou

While some prior works have applied such image GANs to unconditional 2D portrait video generation and static 3D portrait synthesis, there are few works successfully extending GANs for generating 3D-aware portrait videos.

Video Generation

Darwinian Model Upgrades: Model Evolving with Selective Compatibility

no code implementations13 Oct 2022 Binjie Zhang, Shupeng Su, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Mike Zheng Shou, Ying Shan

The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as "backfilling"), which is quite expensive and time-consuming considering billions of instances in industrial applications.

Face Recognition model +1

Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization

2 code implementations18 Aug 2022 Xizhe Xue, Dongdong Yu, Lingqiao Liu, Yu Liu, Satoshi Tsutsui, Ying Li, Zehuan Yuan, Ping Song, Mike Zheng Shou

Based on the single-stage instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss.

Autonomous Driving Object +3

Egocentric Video-Language Pretraining @ Ego4D Challenge 2022

1 code implementation4 Jul 2022 Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, RongCheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou

In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for four Ego4D challenge tasks, including Natural Language Query (NLQ), Moment Query (MQ), Object State Change Classification (OSCC), and PNR Localization (PNR).

Language Modeling Language Modelling +1

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

1 code implementation CVPR 2023 Xudong Lin, Simran Tiwari, Shiyuan Huang, Manling Li, Mike Zheng Shou, Heng Ji, Shih-Fu Chang

We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data.

Retrieval Sentence +2

Label-Efficient Online Continual Object Detection in Streaming Video

1 code implementation ICCV 2023 Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, Mike Zheng Shou

Remarkably, with only 25% annotated video frames, our method still outperforms the base CL learners, which are trained with 100% annotations on all video frames.

Continual Learning Hippocampus +3

Novel View Synthesis for High-fidelity Headshot Scenes

1 code implementation31 May 2022 Satoshi Tsutsui, Weijia Mao, Sijing Lin, Yunyi Zhu, Murong Ma, Mike Zheng Shou

Based on these observations, we propose a method to use both NeRF and 3DMM to synthesize a high-fidelity novel view of a scene with a face.

Generative Adversarial Network NeRF +2

Unified Transformer Tracker for Object Tracking

1 code implementation CVPR 2022 Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, Zhicheng Yan

Although UniTrack \cite{wang2021different} demonstrates that a shared appearance model with multiple heads can be used to tackle individual tracking tasks, it fails to exploit the large-scale tracking datasets for training and performs poorly on single object tracking.

Multiple Object Tracking Object

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

2 code implementations15 Mar 2022 Guanyu Cai, Yixiao Ge, Binjie Zhang, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, XiaoHu Qie, Jianping Wu, Mike Zheng Shou

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.

Question Answering Retrieval +4

All in One: Exploring Unified Video-Language Pre-training

1 code implementation CVPR 2023 Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Ranked #6 on TGIF-Transition on TGIF-QA (using extra training data)

All Language Modelling +11

AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant

4 code implementations8 Mar 2022 Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou

In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos to provide step-by-step help in the user's view.

Visual Question Answering (VQA)

Contrastive Learning of Semantic and Visual Representations for Text Tracking

1 code implementation30 Dec 2021 Zhuang Li, Weijia Wu, Mike Zheng Shou, Jiahong Li, Size Li, Zhongyuan Wang, Hong Zhou

Semantic representation is of great benefit to the video text tracking(VTT) task that requires simultaneously classifying, detecting, and tracking texts in the video.

Contrastive Learning

Video-Text Pre-training with Learned Regions

1 code implementation2 Dec 2021 Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.

Representation Learning Text Retrieval +1

Object-aware Video-language Pre-training for Retrieval

1 code implementation CVPR 2022 Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.

Object Retrieval +2

AVA-AVD: Audio-Visual Speaker Diarization in the Wild

9 code implementations29 Nov 2021 Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye, Mike Zheng Shou

Audio-visual speaker diarization aims at detecting "who spoke when" using both auditory and visual signals.

Relation Network speaker-diarization +1

Ego4D: Around the World in 3,000 Hours of Egocentric Video

8 code implementations CVPR 2022 Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Cartillier, Sean Crane, Tien Do, Morrie Doulaty, Akshay Erapalli, Christoph Feichtenhofer, Adriano Fragomeni, Qichen Fu, Abrham Gebreselasie, Cristina Gonzalez, James Hillis, Xuhua Huang, Yifei HUANG, Wenqi Jia, Weslie Khoo, Jachym Kolar, Satwik Kottur, Anurag Kumar, Federico Landini, Chao Li, Yanghao Li, Zhenqiang Li, Karttikeya Mangalam, Raghava Modhugu, Jonathan Munro, Tullie Murrell, Takumi Nishiyasu, Will Price, Paola Ruiz Puentes, Merey Ramazanova, Leda Sari, Kiran Somasundaram, Audrey Southerland, Yusuke Sugano, Ruijie Tao, Minh Vo, Yuchen Wang, Xindi Wu, Takuma Yagi, Ziwei Zhao, Yunyi Zhu, Pablo Arbelaez, David Crandall, Dima Damen, Giovanni Maria Farinella, Christian Fuegen, Bernard Ghanem, Vamsi Krishna Ithapu, C. V. Jawahar, Hanbyul Joo, Kris Kitani, Haizhou Li, Richard Newcombe, Aude Oliva, Hyun Soo Park, James M. Rehg, Yoichi Sato, Jianbo Shi, Mike Zheng Shou, Antonio Torralba, Lorenzo Torresani, Mingfei Yan, Jitendra Malik

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite.

De-identification Ethics

On Pursuit of Designing Multi-modal Transformer for Video Grounding

no code implementations EMNLP 2021 Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou

Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression.

All Decoder +2

Generic Event Boundary Detection: A Benchmark for Event Segmentation

2 code implementations ICCV 2021 Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, Matt Feiszli

This paper presents a novel task together with a new benchmark for detecting generic, taxonomy-free event boundaries that segment a whole video into chunks.

Action Detection Boundary Detection +3

Channel Augmented Joint Learning for Visible-Infrared Recognition

3 code implementations ICCV 2021 Mang Ye, Weijian Ruan, Bo Du, Mike Zheng Shou

This paper introduces a powerful channel augmented joint learning strategy for the visible-infrared recognition problem.

Data Augmentation Diversity +1

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

3 code implementations CVPR 2021 Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, Hongsheng Li

We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context.

Action Detection Action Recognition +5

Cannot find the paper you are looking for? You can Submit a new open access paper.