Monocular 3D object detection is a challenging task due to unreliable depth, resulting in a distinct performance gap between monocular and LiDAR-based approaches.
To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
To help human experts better find the AI algorithms' biases, we study a new problem in this work -- for a classifier that predicts a target attribute of the input image, discover its unknown biased attribute.
A na\"ive method is to decompose it into two sub-tasks: video frame interpolation (VFI) and video super-resolution (VSR).
This paper addresses previous limitations by learning a deep learning lighting model, that in combination with a high-quality 3D face tracking algorithm, provides a method for subtle and robust facial motion transfer from a regular video to a 3D photo-realistic avatar.
To solve this new task, we first present a new language-driven image editing dataset that supports both local and global editing with editing operation and mask annotations.
The marriage of recurrent neural networks and neural ordinary differential networks (ODE-RNN) is effective in modeling irregularly-observed sequences.
no code implementations • 1 Aug 2020 • Jing Shi, Zhiheng Li, Haitian Zheng, Yihang Xu, Tianyou Xiao, Weitao Tan, Xiaoning Guo, Sizhe Li, Bin Yang, Zhexin Xu, Ruitao Lin, Zhongkai Shangguan, Yue Zhao, Jingwen Wang, Rohan Sharma, Surya Iyer, Ajinkya Deshmukh, Raunak Mahalik, Srishti Singh, Jayant G Rohra, Yi-Peng Zhang, Tongyu Yang, Xuan Wen, Ethan Fahnestock, Bryce Ikeda, Ian Lawson, Alan Finkelstein, Kehao Guo, Richard Magnotti, Andrew Sexton, Jeet Ketan Thaker, Yiyang Su, Chenliang Xu
This technical report summarizes submissions and compiles from Actor-Action video classification challenge held as a final project in CSC 249/449 Machine Vision course (Spring 2020) at University of Rochester
In this paper, we introduce a new problem, named audio-visual video parsing, which aims to parse a video into temporal event segments and label them as either audible, visible, or both.
When people deliver a speech, they naturally move heads, and this rhythmic head motion conveys prosodic information.
The selection of coarse-grained (CG) mapping operators is a critical step for CG molecular dynamics (MD) simulation.
We present a simple yet highly generalizable method for explaining interacting parts within a neural network's reasoning process.
In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies.
Instead of blindly trusting quality-inconsistent PAs, WS^2 employs a learning-based selection to select effective PAs and a novel region integrity criterion as a stopping condition for weakly-supervised training.
The perceptual-based grouping process produces a hierarchical and compositional image representation that helps both human and machine vision systems recognize heterogeneous visual concepts.
Rather than synthesizing missing LR video frames as VFI networks do, we firstly temporally interpolate LR frame features in missing LR video frames capturing local temporal contexts by the proposed feature temporal interpolation network.
Ranked #1 on Video Frame Interpolation on Vid4 - 4x upscaling
Deep neural networks can form high-level hierarchical representations of input data.
To overcome those limitations, we propose a novel self-supervised model to synthesize garment images with disentangled attributes (e. g., collar and sleeves) without paired data.
In this paper, we propose a confidence segmentation (ConfSeg) module that builds confidence score for each pixel in CAM without introducing additional hyper-parameters.
Pose guided synthesis aims to generate a new image in an arbitrary target pose while preserving the appearance details from the source image.
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions.
Video action recognition, a critical problem in video understanding, has been gaining increasing attention.
Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames).
To achieve this, we propose a multimodal convolutional neural network-based audio-visual video captioning framework and introduce a modality-aware module for exploring modality selection during sentence generation.
Furthermore, we study multiple modalities including description and transcripts for the purpose of boosting video understanding.
To overcome such limitation, we propose a GAN based EM learning framework that can maximize the likelihood of images and estimate the latent variables with only the constraint of L-Lipschitz continuity.
Deep neural networks trained on demonstrations of human actions give robot the ability to perform self-driving on the road.
In this paper, we consider a task of such: given an arbitrary audio speech and one lip image of arbitrary target identity, generate synthesized lip movements of the target identity saying the speech.
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos.
The major difficulty of our segmentation model comes with the fact that the location, structure, and shape of gliomas vary significantly among different patients.
However, current methods for detailed understanding of actor and action have significant limitations: they require large amounts of finely labeled data, and they fail to capture any internal relationship among actors and actions.
Action segmentation as a milestone towards building automatic systems to understand untrimmed videos has received considerable attention in the recent years.
Despite the rapid progress, existing works on action understanding focus strictly on one type of action agent, which we call actor---a human adult, ignoring the diversity of actions performed by other actors.
Being the first to explore this new problem, we compose two new datasets with pairs of images and sounds of musical performances of different instruments.
To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments.
Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance.
Supervoxel segmentation has strong potential to be incorporated into early video analysis as superpixel segmentation has in image analysis.
Actor-action semantic segmentation made an important step toward advanced video understanding problems: what action is happening; who is performing the action; and where is the action in space-time.
There is no work we know of on simultaneously inferring actors and actions in the video, not to mention a dataset to experiment with.
In this paper, we conduct a systematic study of how well the actor and action semantics are retained in video supervoxel segmentation.
The problem of describing images through natural language has gained importance in the computer vision community.