When pre-training on the large-scale Kinetics-710, we achieve 89. 7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST.
For snippet-level learning, we introduce an online-updated memory to store reliable snippet prototypes for each class.
To address this issue, we introduce a new perspective to synthesize the signal-independent noise by a generative model.
Ranked #2 on Image Denoising on SID SonyA7S2 x300
Extensive experiments demonstrate that our model not only significantly improves existing methods on all these tasks, but also shows great ability in the few-shot and domain generalization settings.
Ranked #3 on Text based Person Retrieval on ICFG-PEDES
To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder.
Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task.
Under this novel view, we propose a Class Center Similarity layer (CCS layer) to address the above-mentioned challenges by generating adaptive class centers conditioned on different scenes and supervising the similarities between class centers.
To be specific, HyRSM++ consists of two key components, a hybrid relation module and a temporal set matching metric.
Human-Object Interaction (HOI) detection aims to learn how human interacts with surrounding objects.
Person search aims at localizing and recognizing query persons from raw video frames, which is a combination of two sub-tasks, i. e., pedestrian detection and person re-identification.
Ranked #3 on Person Search on PRW
The former is to reduce the memory cost by preserving only one condensed frame instead of the whole video, while the latter aims to compensate the lost spatio-temporal details in the Frame Condensing stage.
Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos.
Ranked #9 on Action Recognition on Something-Something V2
This technical report presents our first place winning solution for temporal action detection task in CVPR-2022 AcitivityNet Challenge.
To overcome the two limitations, we propose a novel Hybrid Relation guided Set Matching (HyRSM) approach that incorporates two key components: hybrid relation module and set matching metric.
In this work, we aim to learn representations by leveraging more abundant information in untrimmed videos.
However, the transformer directly partitions the crowd images into a series of tokens, which may not be a good choice due to each pedestrian being an independent individual, and the parameter number of the network is very large.
First, we present a Domain-Specific Contrastive Learning (DSCL) mechanism to fully explore intradomain information by comparing samples only from the same domain.
By combining two fundamental learning approaches in DML, e. g., classification training and pairwise training, we set up a strong baseline for ZS-SBIR.
In this work, we present a new method for 3D face reconstruction from sparse-view RGB images.
We introduce a Noise Disentanglement Module (NDM) to disentangle the noise and content in the reflectance maps with the reliable aid of unpaired clean images.
Ranked #1 on Low-Light Image Enhancement on MEF (NIQE metric)
Image manipulation with StyleGAN has been an increasing concern in recent years. Recent works have achieved tremendous success in analyzing several semantic latent spaces to edit the attributes of the generated images. However, due to the limited semantic and spatial manipulation precision in these latent spaces, the existing endeavors are defeated in fine-grained StyleGAN image manipulation, i. e., local attribute translation. To address this issue, we discover attribute-specific control units, which consist of multiple channels of feature maps and modulation styles.
The last layer of FCN is typically a global classifier (1x1 convolution) to recognize each pixel to a semantic label.
Ranked #17 on Semantic Segmentation on PASCAL Context
Large-scale labeled training data is often difficult to collect, especially for person identities.
The visualizations show that ParamCrop adaptively controls the center distance and the IoU between two augmented views, and the learned change in the disparity along the training process is beneficial to learning a strong representation.
Temporal action localization aims to localize starting and ending time with action category.
Most recent approaches for online action detection tend to apply Recurrent Neural Network (RNN) to capture long-range temporal structure.
Ranked #7 on Online Action Detection on THUMOS'14
We calculate the detection results by assigning the proposals with corresponding classification results.
Ranked #1 on Temporal Action Localization on ActivityNet-1.3 (using extra training data)
Then our proposed Local-Global Background Modeling Network (LGBM-Net) is trained to localize instances by using only video-level labels based on Multi-Instance Learning (MIL).
This technical report analyzes an egocentric video action detection method we used in the 2021 EPIC-KITCHENS-100 competition hosted in CVPR2021 Workshop.
In this paper, we present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset.
In this paper, we propose a Hybrid Attention Network (HAN) by employing Progressive Embedding Scale-context (PES) information, which enables the network to simultaneously suppress noise and adapt head scale variation.
We introduce a lightweight unit, conditional channel weighting, to replace costly pointwise (1x1) convolutions in shuffle blocks.
Ranked #37 on Pose Estimation on COCO test-dev
In this paper, we focus on applying the power of self-supervised methods to improve semi-supervised action proposal generation.
Ranked #2 on Semi-Supervised Action Detection on THUMOS' 14
In this paper, we propose Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through "local and global" temporal context aggregation and complementary as well as progressive boundary refinement.
Ranked #4 on Temporal Action Localization on ActivityNet-1.3
Specifically, to reconcile the conflicts of multiple objectives, we simplify the standard tightly coupled pipelines and establish a deeply decoupled multi-task learning framework.
Ranked #7 on Person Search on PRW
Specifically, to alleviate the intra-class variations, a clustering method is utilized to generate pseudo labels for both visual and textual instances.
This is different from the previous methods where all the joints are considered holistically and share the same feature.
In the conventional person Re-ID setting, it is widely assumed that cropped person images are for each individual.
In this paper, we present a Representative Graph (RepGraph) layer to dynamically sample a few representative features, which dramatically reduces redundancy.
By this means, the proposed MLTPN can learn rich and discriminative features for different action instances with different durations.
Human pose estimation is the task of localizing body keypoints from still images.
This technical report analyzes a temporal action localization method we used in the HACS competition which is hosted in Activitynet Challenge 2020. The goal of our task is to locate the start time and end time of the action in the untrimmed video, and predict action category. Firstly, we utilize the video-level feature information to train multiple video-level action classification models.
In this report, we present our solution for the task of temporal action localization (detection) (task 1) in ActivityNet Challenge 2020.
The module builds a fully connected directed graph between the regions of different density where each node (region) is represented by weighted global pooled feature, and GCN is learned to map this region graph to a set of relation-aware regions representations.
By training image translation and dehazing network in an end-to-end manner, we can obtain better effects of both image translation and dehazing.
Ranked #4 on Image Dehazing on RESIDE-6K
We propose to treat these spatial details and categorical semantics separately to achieve high accuracy and high efficiency for realtime semantic segmentation.
Ranked #1 on Real-Time Semantic Segmentation on COCO-Stuff
Given an input image and corresponding ground truth, Affinity Loss constructs an ideal affinity map to supervise the learning of Context Prior.
Ranked #1 on Scene Understanding on ADE20K val
FFU and BFU add the IoU variance to the results of CFU, yielding class-specific foreground and background features, respectively.
The state-of-the-art methods train the detector individually, and the detected bounding boxes may be sub-optimal for the following re-ID task.
Semantic segmentation requires both rich spatial information and sizeable receptive field.
Ranked #4 on Semantic Segmentation on SkyScapes-Dense
Most existing methods of semantic segmentation still suffer from two aspects of challenges: intra-class inconsistency and inter-class indistinction.
Ranked #5 on Semantic Segmentation on PASCAL VOC 2012 test
We present an effective blind image deblurring method based on a data-driven discriminative prior. Our work is motivated by the fact that a good image prior should favor clear images over blurred images. In this work, we formulate the image prior as a binary classifier which can be achieved by a deep convolutional neural network (CNN). The learned prior is able to distinguish whether an input image is clear or not. Embedded into the maximum a posterior (MAP) framework, it helps blind deblurring in various scenarios, including natural, face, text, and low-illumination images. However, it is difficult to optimize the deblurring method with the learned image prior as it involves a non-linear CNN. Therefore, we develop an efficient numerical approach based on the half-quadratic splitting method and gradient decent algorithm to solve the proposed model. Furthermore, the proposed model can be easily extended to non-uniform deblurring. Both qualitative and quantitative experimental results show that our method performs favorably against state-of-the-art algorithms as well as domain-specific image deblurring approaches.
In this paper, the proposed framework takes a remarkably different direction to resolve the multi-scene detection problem in a bottom-up fashion.
Recently, scene text detection has become an active research topic in computer vision and document analysis, because of its great importance and significant challenge.
Ranked #6 on Scene Text Detection on COCO-Text
However, the task in tracking is to search for a specific object, rather than an object category as in detection.