Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

1 code implementation ICCV 2023 Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang

When pre-training on the large-scale Kinetics-710, we achieve 89. 7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST.

Transfer Learning Video Recognition

Towards Real-World Visual Tracking with Temporal Contexts

1 code implementation20 Aug 2023 Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, Changhong Fu

To handle those problems, we propose a two-level framework (TCTrack) that can exploit temporal contexts efficiently.

Visual Tracking

Rethinking Efficient Tuning Methods from a Unified Perspective

no code implementations1 Mar 2023 Zeyinzi Jiang, Chaojie Mao, Ziyuan Huang, Yiliang Lv, Deli Zhao, Jingren Zhou

The U-Tuning framework can simultaneously encompass existing methods and derive new approaches for parameter-efficient transfer learning, which prove to achieve on-par or better performances on CIFAR-100 and FGVC datasets when compared with existing PETL methods.

Transfer Learning

Physically Plausible Animation of Human Upper Body from a Single Image

no code implementations9 Dec 2022 Ziyuan Huang, Zhengping Zhou, Yung-Yu Chuang, Jiajun Wu, C. Karen Liu

We present a new method for generating controllable, dynamically responsive, and photorealistic human animations.

Progressive Learning without Forgetting

no code implementations28 Nov 2022 Tao Feng, Hangjie Yuan, Mang Wang, Ziyuan Huang, Ang Bian, Jianzhou Zhang

Learning from changing tasks and sequential experience without forgetting the obtained knowledge is a challenging problem for artificial neural networks.

Continual Learning

RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection

3 code implementations5 Sep 2022 Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni, Mingqian Tang

The task of Human-Object Interaction (HOI) detection targets fine-grained visual parsing of humans interacting with their environment, enabling a broad range of applications.

Human-Object Interaction Detection Relation +1

MAR: Masked Autoencoders for Efficient Action Recognition

1 code implementation24 Jul 2022 Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Xiang Wang, Yuehuan Wang, Yiliang Lv, Changxin Gao, Nong Sang

Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos.

Action Classification Action Recognition +1

TCTrack: Temporal Contexts for Aerial Tracking

1 code implementation CVPR 2022 Ziang Cao, Ziyuan Huang, Liang Pan, Shiwei Zhang, Ziwei Liu, Changhong Fu

Temporal contexts among consecutive frames are far from being fully utilized in existing visual trackers.

TAda! Temporally-Adaptive Convolutions for Video Understanding

2 code implementations ICLR 2022 Ziyuan Huang, Shiwei Zhang, Liang Pan, Zhiwu Qing, Mingqian Tang, Ziwei Liu, Marcelo H. Ang Jr

This work presents Temporally-Adaptive Convolutions (TAdaConv) for video understanding, which shows that adaptive weight calibration along the temporal dimension is an efficient way to facilitate modelling complex temporal dynamics in videos.

Ranked #62 on Action Recognition on Something-Something V2 (using extra training data)

Action Classification Action Recognition +2

ParamCrop: Parametric Cubic Cropping for Video Contrastive Learning

1 code implementation24 Aug 2021 Zhiwu Qing, Ziyuan Huang, Shiwei Zhang, Mingqian Tang, Changxin Gao, Marcelo H. Ang Jr, Rong Jin, Nong Sang

The visualizations show that ParamCrop adaptively controls the center distance and the IoU between two augmented views, and the learned change in the disparity along the training process is beneficial to learning a strong representation.

Contrastive Learning

Support-Set Based Cross-Supervision for Video Grounding

no code implementations ICCV 2021 Xinpeng Ding, Nannan Wang, Shiwei Zhang, De Cheng, Xiaomeng Li, Ziyuan Huang, Mingqian Tang, Xinbo Gao

The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts.

Contrastive Learning Video Grounding

Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling

no code implementations20 Jun 2021 Xiang Wang, Zhiwu Qing, Ziyuan Huang, Yutong Feng, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Yuanjie Shao, Nong Sang

Then our proposed Local-Global Background Modeling Network (LGBM-Net) is trained to localize instances by using only video-level labels based on Multi-Instance Learning (MIL).

Weakly-supervised Learning Weakly-supervised Temporal Action Localization +1

Relation Modeling in Spatio-Temporal Action Localization

no code implementations15 Jun 2021 Yutong Feng, Jianwen Jiang, Ziyuan Huang, Zhiwu Qing, Xiang Wang, Shiwei Zhang, Mingqian Tang, Yue Gao

This paper presents our solution to the AVA-Kinetics Crossover Challenge of ActivityNet workshop at CVPR 2021.

Ranked #4 on Spatio-Temporal Action Localization on AVA-Kinetics (using extra training data)

Action Detection Relation +2

A Stronger Baseline for Ego-Centric Action Detection

1 code implementation13 Jun 2021 Zhiwu Qing, Ziyuan Huang, Xiang Wang, Yutong Feng, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Changxin Gao, Marcelo H. Ang Jr, Nong Sang

This technical report analyzes an egocentric video action detection method we used in the 2021 EPIC-KITCHENS-100 competition hosted in CVPR2021 Workshop.

Action Detection

Multi-Scale Feature Aggregation by Cross-Scale Pixel-to-Region Relation Operation for Semantic Segmentation

no code implementations3 Jun 2021 Yechao Bai, Ziyuan Huang, Lyuyu Shen, Hongliang Guo, Marcelo H. Ang Jr, Daniela Rus

Experiment results on two challenging datasets Cityscapes and COCO demonstrate that the RSP head performs competitively on both semantic segmentation and panoptic segmentation with high efficiency.

Panoptic Segmentation Relation +1

Self-supervised Motion Learning from Static Images

1 code implementation CVPR 2021 Ziyuan Huang, Shiwei Zhang, Jianwen Jiang, Mingqian Tang, Rong Jin, Marcelo Ang

We furthermore introduce a static mask in pseudo motions to create local motion patterns, which forces the model to additionally locate notable motion areas for the correct classification. We demonstrate that MoSI can discover regions with large motion even without fine-tuning on the downstream datasets.

Action Recognition Self-Supervised Learning

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

no code implementations1 Jan 2021 Yuqi Huo, Mingyu Ding, Haoyu Lu, Zhiwu Lu, Tao Xiang, Ji-Rong Wen, Ziyuan Huang, Jianwen Jiang, Shiwei Zhang, Mingqian Tang, Songfang Huang, Ping Luo

With the constrained jigsaw puzzles, instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable but meanwhile still ensure that the learned representation is sensitive to spatiotemporal continuity at both the local and global levels.

Representation Learning

Toward Accurate Person-level Action Recognition in Videos of Crowded Scenes

no code implementations16 Oct 2020 Li Yuan, Yichen Zhou, Shuning Chang, Ziyuan Huang, Yunpeng Chen, Xuecheng Nie, Tao Wang, Jiashi Feng, Shuicheng Yan

Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes; (2) lacking training data in the crowd and complex scenes.

Action Recognition In Videos Semantic Segmentation

A Simple Baseline for Pose Tracking in Videos of Crowded Scenes

no code implementations16 Oct 2020 Li Yuan, Shuning Chang, Ziyuan Huang, Yichen Zhou, Yunpeng Chen, Xuecheng Nie, Francis E. H. Tay, Jiashi Feng, Shuicheng Yan

This paper presents our solution to ACM MM challenge: Large-scale Human-centric Video Analysis in Complex Events\cite{lin2020human}; specifically, here we focus on Track3: Crowd Pose Tracking in Complex Events.

Multi-Object Tracking Optical Flow Estimation +1

Towards Accurate Human Pose Estimation in Videos of Crowded Scenes

no code implementations16 Oct 2020 Li Yuan, Shuning Chang, Xuecheng Nie, Ziyuan Huang, Yichen Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan

In this paper, we focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data.

Optical Flow Estimation Pose Estimation

Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications

1 code implementation12 Apr 2020 Feng Xue, Guirong Zhuo, Ziyuan Huang, Wufei Fu, Zhuoyue Wu, Marcelo H. Ang Jr

Our contributions are twofold: a) a novel dense connected prediction (DCP) layer is proposed to provide better object-level depth estimation and b) specifically for autonomous driving scenarios, dense geometrical constrains (DGC) is introduced so that precise scale factor can be recovered without additional cost for autonomous vehicles.

Autonomous Driving Monocular Depth Estimation +1

Keyfilter-Aware Real-Time UAV Object Tracking

1 code implementation11 Mar 2020 Yiming Li, Changhong Fu, Ziyuan Huang, Yinqiang Zhang, Jia Pan

Correlation filter-based tracking has been widely applied in unmanned aerial vehicle (UAV) with high efficiency.

Object Object Tracking +2

Augmented Memory for Correlation Filters in Real-Time UAV Tracking

1 code implementation24 Sep 2019 Yiming Li, Changhong Fu, Fangqiang Ding, Ziyuan Huang, Jia Pan

The outstanding computational efficiency of discriminative correlation filter (DCF) fades away with various complicated improvements.

Computational Efficiency

Learning Aberrance Repressed Correlation Filters for Real-Time UAV Tracking

1 code implementation ICCV 2019 Ziyuan Huang, Changhong Fu, Yiming Li, Fuling Lin, Peng Lu

Traditional framework of discriminative correlation filters (DCF) is often subject to undesired boundary effects.

Object Tracking

