Search Results for author: Xitong Yang

Found 23 papers, 8 papers with code

Deep Multimodal Representation Learning from Temporal Data

no code implementations CVPR 2017 Xitong Yang, Palghat Ramesh, Radha Chitta, Sriganesh Madhvanath, Edgar A. Bernal, Jiebo Luo

In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications.

Audio-Visual Speech Recognition Representation Learning +4

An Interactive Greedy Approach to Group Sparsity in High Dimensions

1 code implementation10 Jul 2017 Wei Qian, Wending Li, Yasuhiro Sogawa, Ryohei Fujimaki, Xitong Yang, Ji Liu

Sparsity learning with known grouping structure has received considerable attention due to wide modern applications in high-dimensional data analysis.

Human Activity Recognition Vocal Bursts Intensity Prediction

The Effectiveness of Instance Normalization: a Strong Baseline for Single Image Dehazing

no code implementations8 May 2018 Zheng Xu, Xitong Yang, Xue Li, Xiaoshuai Sun

We propose a novel deep neural network architecture for the challenging problem of single image dehazing, which aims to recover the clear image from a degraded hazy image.

Image Dehazing Single Image Dehazing

STEP: Spatio-Temporal Progressive Learning for Video Action Detection

1 code implementation CVPR 2019 Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry Davis, Jan Kautz

In this paper, we propose Spatio-TEmporal Progressive (STEP) action detector---a progressive learning framework for spatio-temporal action detection in videos.

Action Detection Action Recognition

Cross-X Learning for Fine-Grained Visual Categorization

no code implementations ICCV 2019 Wei Luo, Xitong Yang, Xianjie Mo, Yuheng Lu, Larry S. Davis, Jun Li, Jian Yang, Ser-Nam Lim

Recognizing objects from subcategories with very subtle differences remains a challenging task due to the large intra-class and small inter-class variation.

Ranked #18 on Fine-Grained Image Classification on NABirds (using extra training data)

Fine-Grained Image Classification Fine-Grained Visual Categorization

Hierarchical Contrastive Motion Learning for Video Action Recognition

no code implementations20 Jul 2020 Xitong Yang, Xiaodong Yang, Sifei Liu, Deqing Sun, Larry Davis, Jan Kautz

Thus, the motion features at higher levels are trained to gradually capture semantic dynamics and evolve more discriminative for action recognition.

Action Recognition Contrastive Learning +2

GTA: Global Temporal Attention for Video Action Understanding

no code implementations15 Dec 2020 Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, Abhinav Shrivastava

To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner.

Action Recognition Action Understanding +1

Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

no code implementations CVPR 2021 Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry Davis, Heng Wang

The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label.

Action Detection Action Recognition +1

Semi-Supervised Vision Transformers

1 code implementation22 Nov 2021 Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

Surprisingly, we show Vision Transformers perform significantly worse than Convolutional Neural Networks when only a small set of labeled data is available.

Inductive Bias Semi-Supervised Image Classification

Efficient Video Transformers with Spatial-Temporal Token Selection

1 code implementation23 Nov 2021 Junke Wang, Xitong Yang, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang

Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost.

Video Recognition

ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization

1 code implementation CVPR 2022 Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, Abhinav Shrivastava

Without the boundary information of action segments, existing methods mostly rely on multiple instance learning (MIL), where the predictions of unlabeled instances (i. e., video snippets) are supervised by classifying labeled bags (i. e., untrimmed videos).

Weakly Supervised Temporal Action Localization

Vision Transformers Are Good Mask Auto-Labelers

no code implementations CVPR 2023 Shiyi Lan, Xitong Yang, Zhiding Yu, Zuxuan Wu, Jose M. Alvarez, Anima Anandkumar

We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations.

Instance Segmentation Segmentation +1

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model via Interpolated Weight Optimization

1 code implementation1 Feb 2023 Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, Yu-Gang Jiang

Our framework extends CLIP with minimal modifications to model spatial-temporal relationships in videos, making it a specialized video classifier, while striving for generalization.

Action Recognition Continual Learning +2

MINOTAUR: Multi-task Video Grounding From Multimodal Queries

no code implementations16 Feb 2023 Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar, Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran

Video understanding tasks take many forms, from action detection to visual query localization and spatio-temporal grounding of sentences.

Action Detection Sentence +2

Towards Scalable Neural Representation for Diverse Videos

no code implementations CVPR 2023 Bo He, Xitong Yang, Hanyu Wang, Zuxuan Wu, Hao Chen, Shuaiyi Huang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava

Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e. g., NeRV, E-NeRV).

Action Recognition Video Compression

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data

1 code implementation8 Oct 2023 Zuxuan Wu, Zejia Weng, Wujian Peng, Xitong Yang, Ang Li, Larry S. Davis, Yu-Gang Jiang

Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition.

Action Recognition Continual Learning +5

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

no code implementations30 Nov 2023 Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.

Video Understanding

Video ReCap: Recursive Captioning of Hour-Long Videos

no code implementations20 Feb 2024 Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.

Video Captioning Video Understanding

Cannot find the paper you are looking for? You can Submit a new open access paper.