We formulate the new setup by a dual detection task which first detects integral text units and then groups them into a CTB.
In this paper, we consider inferring GRNs in single cells based on single cell RNA sequencing (scRNA-seq) data.
In two typical cross-domain semantic segmentation tasks, i. e., GTA5 to Cityscapes and SYNTHIA to Cityscapes, our method achieves the state-of-the-art segmentation accuracy.
The comparisons of distribution differences between HQ and LQ images can help our model better assess the image quality.
In this work, we propose the TNS (Time-aware Neighbor Sampling) method: TNS learns from temporal information to provide an adaptive receptive neighborhood for every node at any time.
However, it is still in its infancy with two concerns: 1) changing the graph structure through data augmentation to generate contrastive views may mislead the message passing scheme, as such graph changing action deprives the intrinsic graph structural information, especially the directional structure in directed graphs; 2) since GCL usually uses predefined contrastive views with hand-picking parameters, it does not take full advantage of the contrastive information provided by data augmentation, resulting in incomplete structure information for models learning.
To address this issue, our idea is to transform the temporal graphs using data augmentation (DA) with adaptive magnitudes, so as to effectively augment the input features and preserve the essential semantic information.
A more realistic object detection paradigm, Open-World Object Detection, has arisen increasing research interests in the community recently.
Inspired by the recent success in Automated Machine Learning literature (AutoML), in this paper, we present Meta Navigator, a framework that attempts to solve the aforementioned limitation in few-shot learning by seeking a higher-level strategy and proffer to automate the selection from various few-shot learning designs.
Large-scale labeled training data is often difficult to collect, especially for person identities.
Video scene parsing is a long-standing challenging task in computer vision, aiming to assign pre-defined semantic labels to pixels of all frames in a given video.
To address this, this paper proposes to mine the contextual information beyond individual images to further augment the pixel representations.
For tackling such practical problem, we propose a Dual-Learner-based Video Highlight Detection (DL-VHD) framework.
Retrieving occlusion relation among objects in a single image is challenging due to sparsity of boundaries in image.
The nonlocal-based blocks are designed for capturing long-range spatial-temporal dependencies in computer vision tasks.
Given input images, scene graph generation (SGG) aims to produce comprehensive, graphical representations describing visual relationships among salient objects.
Ranked #2 on Unbiased Scene Graph Generation on Visual Genome
Contrastive learning applied to self-supervised representation learning has seen a resurgence in deep models.
In particular, we decouple the training of the representation and the classifier, and systematically investigate the effects of different data re-sampling techniques when training the whole network including a classifier as well as fine-tuning the feature extractor only.
The first task focuses on image-to-character (I2C) mapping which detects a set of character candidates from images based on different alignments of visual features in an non-sequential way.
In this paper, we propose a new loss based on center predictivity, that is, a sample must be positioned in a location of the feature space such that from it we can roughly predict the location of the center of same-class samples.
It is also worth pointing that, given identical strong data augmentations, the performance improvement of ConTNet is more remarkable than that of ResNet.
We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).
In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image.
Fine-grained visual classification (FGVC) which aims at recognizing objects from subcategories is a very challenging task due to the inherently subtle inter-class differences.
Ranked #3 on Fine-Grained Image Classification on CUB-200-2011
Convolution has been the core ingredient of modern neural networks, triggering the surge of deep learning in vision.
Ranked #591 on Image Classification on ImageNet
In this paper, a retrieval-based coarse-to-fine framework is proposed, where we re-rank the TopN classification results by using the local region enhanced embedding features to improve the Top1 accuracy (based on the observation that the correct category usually resides in TopN results).
Ranked #14 on Fine-Grained Image Classification on CUB-200-2011
Most typical click models assume that the probability of a document to be examined by users only depends on position, such as PBM and UBM.
When choosing Chebyshev graph filter, a generalized formulation can be derived for explaining the existing nonlocal-based blocks (e. g. nonlocal block, nonlocal stage, double attention block) and uses to analyze their irrationality.
By disentangling representations on both image and instance levels, DIDN is able to learn domain-invariant representations that are suitable for generalized object detection.
A popular attempts towards the challenge is unpaired generative adversarial networks, which generate "real" LR counterparts from real HR images using image-to-image translation and then perform super-resolution from "real" LR->SR.
In this work, we propose TransTrack, a simple but efficient scheme to solve the multiple object tracking problems.
Ranked #8 on Multi-Object Tracking on DanceTrack
We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference.
In particular, for real-time generation tasks, different devices require generators of different sizes due to varying computing power.
Specifically, our proposed network consists of three main parts: Siamese Encoder Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module.
Ranked #5 on Unsupervised Video Object Segmentation on FBMS test
In this paper, we study what would happen when normalization layers are removed from the network, and show how to train deep neural networks without normalization layers and without performance degradation.
In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location.
Ranked #87 on Object Detection on COCO minival
Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions.
Semi-supervised video object segmentation is an interesting yet challenging task in machine learning.
To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression.
In this paper, a novel Context-and-Spatial Aware Network (CSANet), which integrates both a Context Aware Path and Spatial Aware Path, is proposed to obtain effective features involving both context information and spatial information.
Most previous models try to learn a fixed one-directional mapping between visual and semantic space, while some recently proposed generative methods try to generate image features for unseen classes so that the zero-shot learning problem becomes a traditional fully-supervised classification problem.
In this work, we propose a mask propagation network to treat the video segmentation problem as a concept of the guided instance segmentation.
Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity to model deep cross-domain interactions.
Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification.
There has been a drastic growth of research in Generative Adversarial Nets (GANs) in the past few years.
In this way, the sequential representation of an image can be naturally translated to a sequence of words, as the target sequence of the RNN model.
Different from existing work where basic morphing types on the layer level were addressed, we target at the central problem of network morphism at a higher level, i. e., how a convolutional layer can be morphed into an arbitrary module of a neural network.
In this paper, we develop a Single frame Video Parsing (SVP) method which requires only one labeled frame per video in training stage.
A hierarchical shape parsing strategy is proposed to partition and organize image components into a hierarchical structure in the scale space.