Existing work on sign language translation--that is, translation from sign language videos into sentences in a written language--has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings.
This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before.
We propose Neural Neighbor Style Transfer (NNST), a pipeline that offers state-of-the-art quality, generalization, and competitive efficiency for artistic style transfer.
We present an oracle-efficient algorithm for boosting the adversarial robustness of barely robust learners.
Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams.
We propose a benchmark and a suite of evaluation metrics, some of which reflect the effect of detection on the downstream fingerspelling recognition task.
In this work, we extend monocular self-supervised depth and ego-motion estimation to large-baseline multi-camera rigs.
We study image segmentation from an information-theoretic perspective, proposing a novel adversarial method that performs unsupervised segmentation by partitioning images into maximally independent sets.
Ranked #1 on Unsupervised Image Segmentation on Flowers
Self-supervised learning has emerged as a powerful tool for depth and ego-motion estimation, leading to state-of-the-art results on benchmark datasets.
The core of our approach, Pixel Consensus Voting, is a framework for instance segmentation based on the Generalized Hough transform.
Ranked #34 on Panoptic Segmentation on COCO test-dev
We consider the problem of space-time super-resolution (ST-SR): increasing spatial resolution of video frames and simultaneously interpolating frames to increase the frame rate.
This paper presents a framework for the analysis of changes in visual streams: ordered sequences of images, possibly separated by significant time gaps.
In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media.
Previous feed-forward architectures of recently proposed deep super-resolution networks learn the features of low-resolution inputs and the non-linear mapping from those to a high-resolution output.
Ranked #1 on Image Super-Resolution on BSDS100 - 8x upscaling
We proposed a novel architecture for the problem of video super-resolution.
As the first attempt at fingerspelling recognition in the wild, this work is intended to serve as a baseline for future work on sign language recognition in realistic conditions.
We consider how image super resolution (SR) can contribute to an object detection task in low-resolution images.
The feed-forward architectures of recently proposed deep super-resolution networks learn representations of low-resolution inputs, and the non-linear mapping from those to high-resolution output.
Ranked #13 on Video Super-Resolution on Vid4 - 4x upscaling
As an agent moves through the world, the apparent motion of scene elements is (usually) inversely proportional to their depth.
We formulate the problem of metric learning for k nearest neighbor classification as a large margin structured prediction problem, with a latent variable representing the choice of neighbors and the task loss directly corresponding to classification error.