Understanding the semantics of individual regions or patches within unconstrained images, such as in open-world object detection, represents a critical yet challenging task in computer vision.
CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept.
Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed.
Open-world instance segmentation is a rising task, which aims to segment all objects in the image by learning from a limited number of base-category objects.
We show that only language-paired two-modality data is sufficient to connect all modalities.
Then, following a novel meta optimization scheme to optimize the model to obtain good testing performance on the virtual testing sets after training on the virtual training set, our framework can effectively drive the model to better capture semantics and visual representations of individual concepts, and thus obtain robust generalization performance even when handling novel compositions.
Pre-training VTs on such corrupted data can be challenging, especially when we pre-train via the masked autoencoding approach, where both the inputs and masked ``ground truth" targets can potentially be unreliable in this case.
Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.
We extend our pretext task to supervised pre-training, which achieves a similar performance to self-supervised learning.
Learning image classification and image generation using the same set of network parameters is a challenging problem.
All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks.
This is the first use of sparse convolution for 2D masked modeling.
Ranked #1 on Instance Segmentation on COCO 2017 val
In this work, we end the current fragmented situation and propose UniRef to unify the three reference-based object segmentation tasks with a single architecture.
The existing end-to-end methods rely on dense representations to preserve the spatial detail and structure for precise keypoint localization.
In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
Person search aims at localizing and recognizing query persons from raw video frames, which is a combination of two sub-tasks, i. e., pedestrian detection and person re-identification.
Ranked #3 on Person Search on PRW
As a result, our model can extract effectively both static appearance and dynamic motion spontaneously, leading to superior spatiotemporal representation learning capability.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression.
Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale.
To this end, we propose a novel soft mining contextual information beyond image paradigm named MCIBI++ to further boost the pixel-level representations.
Based on the single-stage instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss.
We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.
The current distillation algorithm usually improves students' performance by imitating the output of the teacher.
In two typical cross-domain semantic segmentation tasks, i. e., GTA5 to Cityscapes and SYNTHIA to Cityscapes, our method achieves the state-of-the-art segmentation accuracy.
Fine-Grained Visual Classification(FGVC) is the task that requires recognizing the objects belonging to multiple subordinate categories of a super-category.
Ranked #1 on Fine-Grained Image Classification on NABirds (using extra training data)
The comparisons of distribution differences between HQ and LQ images can help our model better assess the image quality.
Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.
Ranked #3 on Referring Video Object Segmentation on MeViS
Then we propose a novel factor-wise discrimination objective in a contrastive learning manner, which can force the factorized representations to independently reflect the expressive information from different latent factors.
For emerging content-based feature fusion, most existing matting methods only focus on local features which lack the guidance of a global feature with strong semantic information related to the interesting object.
Ranked #3 on Image Matting on Composition-1K
A typical pipeline for multi-object tracking (MOT) is to use a detector for object localization, and following re-identification (re-ID) for object association.
Global distillation rebuilds the relation between different pixels and transfers it from teachers to students, compensating for missing global information in focal distillation.
Ranked #1 on Knowledge Distillation on COCO
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving.
ByteTrack also achieves state-of-the-art performance on MOT20, HiEve and BDD100K tracking benchmarks.
Ranked #1 on Multiple Object Tracking on BDD100K val
A more realistic object detection paradigm, Open-World Object Detection, has arisen increasing research interests in the community recently.
Large-scale labeled training data is often difficult to collect, especially for person identities.
Video scene parsing is a long-standing challenging task in computer vision, aiming to assign pre-defined semantic labels to pixels of all frames in a given video.
In this paper, we propose a new loss based on center predictivity, that is, a sample must be positioned in a location of the feature space such that from it we can roughly predict the location of the center of same-class samples.
We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).
Motivated by this question, we conduct a series of studies on the performance of self-supervised contrastive learning and supervised learning methods over multiple datasets where training instance distributions vary from a balanced one to a long-tailed one.
Ranked #39 on Long-tail Learning on CIFAR-10-LT (ρ=10)
By disentangling representations on both image and instance levels, DIDN is able to learn domain-invariant representations that are suitable for generalized object detection.
A popular attempts towards the challenge is unpaired generative adversarial networks, which generate "real" LR counterparts from real HR images using image-to-image translation and then perform super-resolution from "real" LR->SR.
In this work, we propose TransTrack, a simple but efficient scheme to solve the multiple object tracking problems.
Ranked #4 on Multi-Object Tracking on SportsMOT (using extra training data)
In particular, for real-time generation tasks, different devices require generators of different sizes due to varying computing power.
We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference.
In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location.
Ranked #5 on 2D Object Detection on CeyMo
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation.
To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression.
Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity to model deep cross-domain interactions.
Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification.
We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores.