Denoising diffusion probabilistic models are currently becoming the leading paradigm of generative modeling for many important data modalities.
Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.
These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention.
Ranked #28 on Semantic Segmentation on ADE20K
Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question.
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
Ranked #1 on Speech Recognition on CHiME6
We introduce Mintaka, a complex, natural, and multilingual dataset designed for experimenting with end-to-end question-answering models.
Given that intrinsic decomposition is a fundamentally ambiguous and under-constrained inverse problem, we propose a novel distance-aware point sampling and adaptive reflectance iterative clustering optimization method that enables IntrinsicNeRF with traditional intrinsic decomposition constraints to be trained in an unsupervised manner, resulting in temporally consistent intrinsic decomposition results.