Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection.
Ranked #1 on Video Object Detection on ImageNet VID
It can adaptively enhance source detector to perceive objects in a target image, by leveraging target proposal contexts from iterative cross-attention.
Specifically, TRKP adopts the teacher-student framework, where the multi-head teacher network is built to extract knowledge from labeled source domains and guide the student network to learn detectors in unlabeled target domain.
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition.
Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning.
Ranked #72 on Image Classification on ImageNet
To fill this gap, we propose a generic Contour-Perturbed Reconstruction Network (CP-Net), which can effectively guide self-supervised reconstruction to learn semantic content in the point cloud, and thus promote discriminative power of point cloud representation.
For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60. 9% and 71. 2% top-1 accuracy respectively.
2 code implementations • 22 Dec 2021 • Liang Pan, Tong Wu, Zhongang Cai, Ziwei Liu, Xumin Yu, Yongming Rao, Jiwen Lu, Jie zhou, Mingye Xu, Xiaoyuan Luo, Kexue Fu, Peng Gao, Manning Wang, Yali Wang, Yu Qiao, Junsheng Zhou, Xin Wen, Peng Xiang, Yu-Shen Liu, Zhizhong Han, Yuanjie Yan, Junyi An, Lifa Zhu, Changwei Lin, Dongrui Liu, Xin Li, Francisco Gómez-Fernández, Qinlong Wang, Yang Yang
Based on the MVP dataset, this paper reports methods and results in the Multi-View Partial Point Cloud Challenge 2021 on Completion and Registration.
Self-attention has become an integral component of the recent network architectures, e. g., Transformer, that dominate major image and video benchmarks.
Ranked #11 on Action Recognition on Something-Something V2 (using extra training data)
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation.
For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60. 8% and 71. 4% top-1 accuracy respectively.
Ranked #1 on Action Recognition on Something-Something V1
Graph Convolution Network (GCN) has been successfully used for 3D human pose estimation in videos.
Specially, the limitations can be categorized into two types: ambiguious supervision in foreground and invalid supervision in background.
In this work, an object detector based on YOLOF has been proposed to detect blood cell objects such as red blood cells, white blood cells and platelets.
It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module.
Ranked #6 on Action Recognition on Something-Something V1
Second, the coarse action classes often lead to the ambiguous annotations of temporal boundaries, which are inappropriate for temporal action localization.
By coupling advanced 3D pose estimators and HMR in a serial or parallel manner, these two frameworks can effectively correct human mesh with guidance of a concise pose calibration module.
A preference based multi-objective evolutionary algorithm is proposed for generating solutions in an automatically detected knee point region.
Our SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency.
Given a point in $m$-dimensional objective space, any $\varepsilon$-ball of a point can be partitioned into the incomparable, the dominated and dominating region.
A customized multi-objective evolutionary algorithm (MOEA) is proposed for the multi-objective flexible job shop scheduling problem (FJSP).
Few-shot object detection is a challenging but realistic scenario, where only a few annotated training images are available for training detectors.
These distinct gate vectors inherit mutual context on semantic differences, which allow API-Net to attentively capture contrastive clues by pairwise interaction between two images.
Ranked #3 on Fine-Grained Image Classification on Stanford Dogs
no code implementations • 9 Feb 2019 • Chu Qin, Ying Tan, Shang Ying Chen, Xian Zeng, Xingxing Qi, Tian Jin, Huan Shi, Yiwei Wan, Yu Chen, Jingfeng Li, Weidong He, Yali Wang, Peng Zhang, Feng Zhu, Hongping Zhao, Yuyang Jiang, Yuzong Chen
We ex-plored the superior learning capability of deep autoencoders for unsupervised clustering of 1. 39 mil-lion bioactive molecules into band-clusters in a 3-dimensional latent chemical space.
To mimic this capacity, we propose a novel Hybrid Video Memory (HVM) machine, which can hallucinate temporal features of still images from video memory, in order to boost action recognition with few still images.
Second, we introduce a novel regularized transfer learning framework for low-shot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance fine-tuning with a few target images.
Ranked #14 on Few-Shot Object Detection on MS-COCO (30-shot)
Firstly, unlike previous works on pose-related action recognition, our RPAN is an end-to-end recurrent network which can exploit important spatial-temporal evolutions of human pose to assist action recognition in a unified framework.
Ranked #5 on Skeleton Based Action Recognition on J-HMDB
In this paper, we propose a hybrid representation, which leverages the discriminative capacity of CNNs and the simplicity of descriptor encoding schema for image recognition, with a focus on scene recognition.