Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.
Ranked #1 on Referring Expression Segmentation on Refer-YouTube-VOS (2021 public validation) (using extra training data)
For emerging content-based feature fusion, most existing matting methods only focus on local features which lack the guidance of a global feature with strong semantic information related to the interesting object.
Then we propose a novel factor-wise discrimination objective in a contrastive learning manner, which can force the factorized representations to independently reflect the expressive information from different latent factors.
A typical pipeline for multi-object tracking (MOT) is to use a detector for object localization, and following re-identification (re-ID) for object association.
Global distillation rebuilds the relation between different pixels and transfers it from teachers to students, compensating for missing global information in focal distillation.
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving.
Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos.
Ranked #1 on Multi-Object Tracking on MOT17 (using extra training data)
A more realistic object detection paradigm, Open-World Object Detection, has arisen increasing research interests in the community recently.
Large-scale labeled training data is often difficult to collect, especially for person identities.
Video scene parsing is a long-standing challenging task in computer vision, aiming to assign pre-defined semantic labels to pixels of all frames in a given video.
In this paper, we propose a new loss based on center predictivity, that is, a sample must be positioned in a location of the feature space such that from it we can roughly predict the location of the center of same-class samples.
We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).
Motivated by this question, we conduct a series of studies on the performance of self-supervised contrastive learning and supervised learning methods over multiple datasets where training instance distributions vary from a balanced one to a long-tailed one.
Ranked #23 on Long-tail Learning on ImageNet-LT
By disentangling representations on both image and instance levels, DIDN is able to learn domain-invariant representations that are suitable for generalized object detection.
A popular attempts towards the challenge is unpaired generative adversarial networks, which generate "real" LR counterparts from real HR images using image-to-image translation and then perform super-resolution from "real" LR->SR.
In this work, we propose TransTrack, a simple but efficient scheme to solve the multiple object tracking problems.
We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference.
In particular, for real-time generation tasks, different devices require generators of different sizes due to varying computing power.
In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location.
Ranked #56 on Object Detection on COCO minival
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation.
To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression.
Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity to model deep cross-domain interactions.
Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification.
We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores.