In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos.
We find that these eigenvectors already decompose an image into meaningful segments, and can be readily used to localize objects in a scene.
Based on a combination of the application of special sampling strategies and a subsequent numerical optimization step in post-processing, thermographic super resolution has already proven to be superior to standard thermographic methods in the detection of one-dimensional defect/inhomogeneity structures.
With increasing focus on augmented and virtual reality applications (XR) comes the demand for algorithms that can lift objects from images and videos into representations that are suitable for a wide variety of related 3D tasks.
We benchmark a large set of recent unsupervised multi-object segmentation models on ClevrTex and find all state-of-the-art approaches fail to learn good representations in the textured setting, despite impressive performance on simpler data.
Ranked #2 on Unsupervised Object Segmentation on ClevrTex
First, we construct a proxy task through a set of objectives that encourages the model to learn a meaningful decomposition of the image into its parts.
We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis.
Learning strong representations for multi-modal retrieval is an important problem for many applications, such as recommendation and search.
On the other hand, state-of-the-art pretraining is nowadays obtained with unsupervised methods, meaning that labelled datasets such as ImageNet may not be necessary, or perhaps not even optimal, for model pretraining.
In this paper, we present DOVE, a method that learns textured 3D models of deformable object categories from monocular videos available online, without keypoint, viewpoint or template shape supervision.
Recent research has shown that numerous human-interpretable directions exist in the latent space of GANs.
Is critical input information encoded in specific sparse pathways within the neural network?
A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data.
In our work, we address the novel problem of image manipulation from scene graphs, in which a user can edit images by merely applying changes in the nodes or edges of a semantic graph that is generated from the image.
Attributing the output of a neural network to the contribution of given input elements is a way of shedding light on the black-box nature of neural networks.
We propose a method to learn 3D deformable object categories from raw single-view images, without external supervision.
Combining clustering and representation learning is one of the most promising approaches for unsupervised learning of deep neural networks.
Ranked #2 on Image Clustering on ImageNet
The goal of the IARAI competition traffic4cast was to predict the city-wide traffic status within a 15-minute time window, based on information from the previous hour.
Specifically, given a single image of the object seen from an arbitrary viewpoint, our model predicts a symmetric canonical view, the corresponding 3D shape and a viewpoint transformation, and trains with the goal of reconstructing the input view, resembling an auto-encoder.
We look critically at popular self-supervision techniques for learning deep convolutional neural networks without manual labels.
Further, critical states in which a very high or a very low reward can be achieved are often interesting to understand the situational awareness of the system as they can correspond to risky states.
For each object instance we predict multiple pose and class outcomes to estimate the specific pose distribution generated by symmetries and repetitive textures.
Further, we reformulate the problem of robotic grasping by replacing conventional grasp rectangles with grasp belief maps, which hold more precise location information than a rectangle and account for the uncertainty inherent to the task.
Variational methods for revealing visual concepts learned by convolutional neural networks have gained significant attention during the last years.
Interaction and collaboration between humans and intelligent machines has become increasingly important as machine learning methods move into real-world applications that involve end users.
We introduce a new approach to estimate continuous actions using actor-critic algorithms for reinforcement learning problems.
Ideally, we would combine the ability of machine learning to leverage big data for learning the semantics of a task, while using techniques from task planning to reliably generalize to new environment.
One of the main problems in webly-supervised learning is cleaning the noisy labeled data from the web.
Real-time instrument tracking is a crucial requirement for various computer-assisted interventions.
Recurrent neural networks (RNNs) have achieved state-of-the-art performance on many diverse tasks, from machine translation to surgical activity recognition, yet training RNNs to capture long-term dependencies remains difficult.
In future prediction, for example, many distinct outcomes are equally valid.
We propose a novel hands-free method to interactively segment 3D medical volumes.
We propose a method for interactive boundary extraction which combines a deep, patch-based representation with an active contour framework.
Over the last decade, Convolutional Neural Networks (CNN) saw a tremendous surge in performance.
This paper addresses the problem of estimating the depth map of a scene given a single RGB image.
This paper presents a method for 3D segmentation of kidneys from patients with autosomal dominant polycystic kidney disease (ADPKD) and severe renal insufficiency, using computed tomography (CT) data.
Convolutional Neural Networks (ConvNets) have successfully contributed to improve the accuracy of regression-based methods for computer vision tasks such as human pose estimation, landmark localization, and object detection.