Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images.
Though massive amounts of synthetic RGB images are easy to obtain, the models trained on them suffer from noticeable performance degradation due to the synthetic-to-real domain gap.
Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems.
Understanding 3D environments semantically is pivotal in autonomous driving applications where multiple computer vision tasks are involved.
In this paper, we present a comprehensive study on the utility of deep convolutional neural networks with two state-of-the-art pooling layers which are placed after convolutional layers and fine-tuned in an end-to-end manner for visual place recognition task in challenging conditions, including seasonal and illumination variations.
This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space.
We present HRF-Net, a novel view synthesis method based on holistic radiance fields that renders novel views using a set of sparse inputs.
The pose estimation is decomposed into three sub-tasks: a) object 3D rotation representation learning and matching; b) estimation of the 2D location of the object center; and c) scale-invariant distance estimation (the translation along the z-axis) via classification.
Real-time holistic scene understanding would allow machines to interpret their surrounding in a much more detailed manner than is currently possible.
First, it is rank-insensitive: It ignores the rank positions of successfully localised moments in the top-$K$ ranked list by treating the list as a set.
OC-cost computes the cost of correcting detections to ground truths as a measure of accuracy.
We present LDP, a lightweight dense prediction neural architecture search (NAS) framework.
This paper proposes a universal framework, called OVE6D, for model-based 6D object pose estimation from a single depth image and a target object mask.
In this work, we propose an end-to-end learned video codec that introduces several architectural novelties as well as training novelties, revolving around the concepts of adaptation and attention.
In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU.
The objective of this paper is to perform visual sound separation: i) we study visual sound separation on spectrograms of different temporal resolutions; ii) we propose a new light yet efficient three-stream framework V-SlowFast that operates on Visual frame, Slow spectrogram, and Fast spectrogram.
This paper presents a novel neural architecture search method, called LiDNAS, for generating lightweight monocular depth estimation models.
In addition, we introduce a normalized Hessian loss term invariant to scaling and shear along the depth direction, which is shown to substantially improve the accuracy.
Over recent years, deep learning-based computer vision systems have been applied to images at an ever-increasing pace, oftentimes representing the only type of consumption for those images.
One possible solution approach consists of adapting current human-targeted image and video coding standards to the use case of machine consumption.
We present HybVIO, a novel hybrid approach for combining filtering-based visual-inertial odometry (VIO) with optimization-based SLAM.
More specifically, models using the FlipReID structure are trained on the original images and the flipped images simultaneously, and incorporating the flipping loss minimizes the mean squared error between feature vectors of corresponding image pairs.
Ranked #3 on Person Re-Identification on MSMT17
Supervised object detection has been proven to be successful in many benchmark datasets achieving human-level performances.
The objective of this paper is to perform audio-visual sound source separation, i. e.~to separate component audios from a mixture based on the videos of sound sources.
Image reenactment is a task where the target object in the source image imitates the motion represented in the driving image.
In this paper, we propose enhancing monocular depth estimation by adding 3D points as depth guidance.
We propose a new cascaded architecture for novel view synthesis, called RGBD-Net, which consists of two core components: a hierarchical depth regression network and a depth-aware generator network.
In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task.
In a second phase, the Model-Agnostic Meta-learning approach is adapted to the specific case of image compression, where the inner-loop performs latent tensor overfitting, and the outer loop updates both encoder and decoder neural networks based on the overfitting performance.
A key element in COF is a novel opponent filter module that identifies and relocates residual components between sources.
We show the effectiveness of the proposed model with audio and visual modalities on the dense video captioning task, yet the module is capable of digesting any two modalities in a sequence-to-sequence task.
One of the core components of conventional (i. e., non-learned) video codecs consists of predicting a frame from a previously-decoded frame, by leveraging temporal correlations.
This paper addresses the problem of novel view synthesis by means of neural rendering, where we are interested in predicting the novel view at an arbitrary camera pose based on a given set of input images from other viewpoints.
Recovering the scene depth from a single image is an ill-posed problem that requires additional priors, often referred to as monocular depth cues, to disambiguate different 3D interpretations.
We apply automatic speech recognition (ASR) system to obtain a temporally aligned textual description of the speech (similar to subtitles) and treat it as a separate input alongside video frames and the corresponding audio track.
Ranked #10 on Dense Video Captioning on ActivityNet Captions
Our results suggest that (1) audio is a strong contributing cue for saliency prediction, (2) salient visible sound-source is the natural cause of the superiority of our Audio-Visual model, (3) richer feature representations for the input space leads to more powerful predictions even in absence of more sophisticated saliency decoders, and (4) Audio-Visual model improves over 53. 54\% of the frames predicted by the best Visual model (our baseline).
Extensive experiments over multiple datasets reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up saliency models perform poorly in predicting gaze and underperform spatial biases, (3) deep features perform better compared to traditional features, (4) as opposed to hand regions, the manipulation point is a strong influential cue for gaze prediction, (5) combining the proposed recurrent model with bottom-up cues, vanishing points and, in particular, manipulation point results in the best gaze prediction accuracy over egocentric videos, (6) the knowledge transfer works best for cases where the tasks or sequences are similar, and (7) task and activity recognition can benefit from gaze prediction.
Knee osteoarthritis (OA) is the most common musculoskeletal disease without a cure, and current treatment options are limited to symptomatic relief.
The problem of predicting a novel view of the scene using an arbitrary number of observations is a challenging problem for computers as well as for humans.
This paper presents a generic face animator that is able to control the pose and expressions of a given face image.
Video summarization is a technique to create a short skim of the original video while preserving the main stories/content.
This paper addresses the challenge of dense pixel correspondence estimation between two images.
Ranked #2 on Dense Pixel Correspondence Estimation on HPatches
The lack of realistic and open benchmarking datasets for pedestrian visual-inertial odometry has made it hard to pinpoint differences in published methods.
In this paper, we propose a new general purpose image-to-image translation model that is able to utilize both paired and unpaired training data simultaneously.
In this work we propose a neural network based image descriptor suitable for image patch matching, which is an important task in many computer vision applications.
Here, we also report a radiological OA diagnosis area under the ROC curve of 0. 93.
The labels are provided by annotators possessing different experience with respect to Kendo to demonstrate how the proposed method adapts to different needs.
To investigate the current status in regard to affective image tagging, we (1) introduce a new eye movement dataset using an affordable eye tracker, (2) study the use of deep neural networks for pleasantness recognition, (3) investigate the gap between deep features and eye movements.
In this paper, we propose an encoder-decoder convolutional neural network (CNN) architecture for estimating camera pose (orientation and location) from a single RGB-image.
Building a complete inertial navigation system using the limited quality data provided by current smartphones has been regarded challenging, if not impossible.
This paper presents a convolutional neural network based approach for estimating the relative pose between two cameras.
The obtained results for the used datasets show the mean intersection over the union equal to: 0. 84, 0. 79 and 0. 78.
This paper presents a novel fixation prediction and saliency modeling framework based on inter-image similarities and ensemble of Extreme Learning Machines (ELM).
For this, we design a deep neural network that maps videos as well as descriptions to a common semantic space and jointly trained it with associated pairs of videos and descriptions.
In this paper, we present a method for real-time multi-person human pose estimation from video by utilizing convolutional neural networks.
In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks.
The parameters of the graph cut problems are learnt in such a manner that they provide complementary sets of regions.
no code implementations • • Andrea Vedaldi, Siddharth Mahendran, Stavros Tsogkas, Subhransu Maji, Ross Girshick, Juho Kannala, Esa Rahtu, Iasonas Kokkinos, Matthew B. Blaschko, David Weiss, Ben Taskar, Karen Simonyan, Naomi Saphra, Sammy Mohamed
We show that the collected data can be used to study the relation between part detection and attribute prediction by diagnosing the performance of classifiers that pool information from different parts of an object.
This paper introduces FGVC-Aircraft, a new dataset containing 10, 000 images of aircraft spanning 100 aircraft models, organised in a three-level hierarchy.