Humans can envision a realistic photo given a free-hand sketch that is not only spatially imprecise and geometrically distorted but also without colors and visual details.
We enforce spatial consistency of grouping and bootstrap feature learning with co-segmentation among multiple views of the same image, and enforce semantic consistency across the grouping hierarchy with clustering transformers between coarse- and fine-grained features.
Our key insight is that pseudo-labels are naturally imbalanced due to intrinsic data similarity, even when a model is trained on balanced source data and evaluated on balanced target data.
Ranked #1 on Semi-Supervised Image Classification on ImageNet - 0.2% labeled data (using extra training data)
Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images.
Ranked #14 on Video Polyp Segmentation on SUN-SEG-Easy
Existing SSL methods focus on learning a model that effectively integrates information from given small labeled data and large unlabeled data, whereas we focus on selecting the right data to annotate for SSL without requiring any label or task information.
Specifically, for a network, we create a recurrent parameter generator (RPG), from which the parameters of each convolution layer are generated.
Recent progress in network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet.
Camera trapping is increasingly used to monitor wildlife, but this technology typically requires extensive data annotation.
Weakly supervised segmentation requires assigning a label to every pixel based on training instances with partial annotations such as image-level tags, object bounding boxes, labeled points and scribbles.
Existing methods focus on training an RL policy that is universal to changing visual domains, whereas we focus on extracting visual foreground that is universal, feeding clean invariant vision to the RL policy learner.
Deep learning (DL) based unrolled reconstructions have shown state-of-the-art performance for under-sampled magnetic resonance imaging (MRI).
We take a dynamic view of the training data and provide a principled model bias and variance analysis as the training data fluctuates: Existing long-tail classifiers invariably increase the model variance and the head-tail model bias gap remains large, due to more and larger confusion with hard negatives for the tail.
Ranked #4 on Long-tail Learning on CIFAR-100-LT (ρ=100)
The concept of TBC can also be extended to group convolution and fully connected layers, and can be applied to various backbone networks and attention modules.
Unsupervised feature learning has made great strides with contrastive learning based on instance discrimination and invariant mapping, as benchmarked on curated class-balanced datasets.
Additionally, we propose a sketch standardization module to handle different sketch distortions and styles.
We develop an efficient approach to impose filter orthogonality on a convolutional layer based on the doubly block-Toeplitz matrix representation of the convolutional kernel instead of using the common kernel orthogonality approach, which we show is only necessary but not sufficient for ensuring orthogonal convolutions.
The proposed SegSort further produces an interpretable result, as each choice of label can be easily understood from the retrieved nearest segments.
Ranked #3 on Unsupervised Semantic Segmentation on PASCAL VOC 2012 val (using extra training data)
In the first case, a machine learning-assisted framework, BRAILS, is proposed for city-scale building information modeling.
A typical domain adaptation approach is to adapt models trained on the annotated data in a source domain (e. g., sunny weather) for achieving high performance on the test data in a target domain (e. g., rainy weather).
We propose a novel end-to-end approach to learn different non-rigid transformations of the input point cloud so that optimal local neighborhoods can be adopted at each layer.
On RadioML, our model achieves comparable RF modulation classification accuracy at 10% of the baseline model size.
We define Open Long-Tailed Recognition (OLTR) as learning from such naturally distributed data and optimizing the classification accuracy over a balanced test set which include head, tail, and open classes.
Current major approaches to visual recognition follow an end-to-end formulation that classifies an input image into one of the pre-determined set of semantic categories.
Neural net classifiers trained on data with annotated class labels can also capture apparent visual similarity among categories without being directed to do so.
Ranked #37 on Semi-Supervised Image Classification on ImageNet - 1% labeled data (Top 5 Accuracy metric)
The structure analyzer is trained to maximize the ASM loss, or to emphasize recurring multi-scale hard negative structural mistakes among co-occurring patterns.
In the dental industry, it takes a technician years of training to design synthetic crowns that restore the function and integrity of missing teeth.
Semantic segmentation has made much progress with increasingly powerful pixel-wise classifiers and incorporating structural priors via Conditional Random Fields (CRF) or Generative Adversarial Networks (GAN).
Ranked #48 on Semantic Segmentation on Cityscapes test
To address the overfitting problem in aerial image classification, we consider the neural network as successive transformations of an input image into embedded feature representations and ultimately into a semantic class label, and train neural networks to optimize image representations in the embedded space in addition to optimizing the final classification score.
Rendered with realistic environment maps, millions of synthetic images of objects and their corresponding albedo, shading, and specular ground-truth images are used to train an encoder-decoder CNN.
Most critically, multigrid structure enables networks to learn internal attention and dynamic routing mechanisms, and use them to accomplish tasks on which modern CNNs fail.
In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers.
Finally, we use this feature to learn a basketball assessment model from pairs of labeled first-person basketball videos, for which a basketball expert indicates, which of the two players is better.
It combines these two objectives via a novel random walk layer that enforces consistent spatial grouping in the deep layers of the network.
We address the difficult problem of distinguishing fine-grained object categories in low resolution images.
Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person's visual sensorimotor object interactions from up close.
Spectral embedding provides a framework for solving perceptual organization problems, including image segmentation and figure/ground organization.
We demonstrate results on both the synthetic images of Sintel and the real images of the classic MIT intrinsic image dataset.
We present a regression framework which models the output distribution of neural networks.
We consider the visual sentiment task of mapping an image to an adjective noun pair (ANP) such as "cute baby".
Given a set of poorly aligned images of the same visual concept without any annotations, we propose an algorithm to jointly bring them into pixel-wise correspondence by estimating a FlowWeb representation of the image set.
We frame the task of predicting a semantic labeling as a sparse reconstruction procedure that applies a target-specific learned transfer function to a generic deep sparse code representation of an image.
Size, color, and orientation have long been considered elementary features whose attributes are extracted in parallel and available to guide the deployment of attention.