We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers.
Eliminating time-consuming post-production processes and delivering high-quality videos in today's fast-paced digital landscape are the key advantages of real-time approaches.
Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces.
We present MParrotTTS, a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model that can produce high-quality speech.
We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations.
We investigate the Vision-and-Language Navigation (VLN) problem in the context of autonomous driving in outdoor settings.
We introduce a new dataset, Talk2Car-RegSeg, which extends the existing Talk2car dataset with segmentation masks for the regions described by the linguistic commands.
We find that the presence of multiple domains incentivizes domain agnostic learning and is the primary reason for generalization in Tradition DG.
Multi-view Detection (MVD) is highly effective for occlusion reasoning in a crowded environment.
Ranked #1 on Multiview Detection on GMVD
Most of these systems suffer from the problems of noise in the range-data and resolution mismatch between the range sensor and the color cameras, since the resolution of current range sensors is much less than the resolution of color cameras.
We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description.
Ranked #3 on Referring Expression Segmentation on ReferIt
There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors.
There has been increasing interest in building deep hierarchy-aware classifiers, aiming to quantify and reduce the severity of mistakes and not just count the number of errors.
We also explore a variation of ViNet architecture by augmenting audio features into the decoder.
We present GAZED- eye GAZe-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera.
In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices.
Ranked #6 on Referring Expression Comprehension on Talk2Car
In this paper, we investigate a constrained formulation of neural networks where the output is a convex function of the input.
Multi-object tracking has seen a lot of progress recently, albeit with substantial annotation costs for developing better and larger labeled datasets.
In this paper, we present a method to reliably detect such obstacles through a multi-modal framework of sparse LiDAR(VLP-16) and Monocular vision.
As a result, we propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks.
Autonomous camera systems are often subjected to an optimization/filtering operation to smoothen and stabilize the rough trajectory estimates.
Recent works have proposed several long term tracking benchmarks and highlight the importance of moving towards long-duration tracking to bridge the gap with application requirements.
Monocular head pose estimation requires learning a model that computes the intrinsic Euler angles for pose (yaw, pitch, roll) from an input image of human face.
Ranked #2 on Head Pose Estimation on AFLW
Localizing natural language phrases in images is a challenging problem that requires joint understanding of both the textual and visual modalities.
We present here, a novel network architecture called MergeNet for discovering small obstacles for on-road scenes in the context of autonomous driving.
The proposed method is fully automatic in contrast to the current state of the art which requires manual initialization of point correspondences between the image and the static model.
The prose storyboard language is a formal language for describing movies shot by shot, where each shot is described with a unique sentence.