We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively.
Ranked #1 on Visual Reasoning on NLVR2 Test
In particular the quantizer projects speech inputs with a randomly initialized matrix, and does a nearest-neighbor lookup in a randomly-initialized codebook.
We term this approach as Co-training Videos and Images for Action Recognition (CoVeR).
Ranked #4 on Action Classification on Kinetics-400 (using extra training data)
no code implementations • 19 Nov 2021 • Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le
Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood.
Ranked #2 on Zero-Shot Transfer Image Classification on ImageNet (using extra training data)
Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively.
no code implementations • 27 Sep 2021 • Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, Yonghui Wu
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio.
With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks.
Ranked #2 on Image Captioning on nocaps-val-out-domain
no code implementations • 21 Nov 2020 • Bo Li, Anmol Gulati, Jiahui Yu, Tara N. Sainath, Chung-Cheng Chiu, Arun Narayanan, Shuo-Yiin Chang, Ruoming Pang, Yanzhang He, James Qin, Wei Han, Qiao Liang, Yu Zhang, Trevor Strohman, Yonghui Wu
To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers , which has shown good improvements for ASR.
Audio and Speech Processing Sound
In addition, we show that to leading order in $N$, the variance of the neural network's statistical output decays as the implied normalization by the scaling parameter approaches the mean field normalization.
The proposed model consists of streaming and non-streaming encoders.
FastEmit also improves streaming ASR accuracy from 4. 4%/8. 9% to 3. 1%/7. 5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech.
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses.
The generative adversarial network (GAN) framework has emerged as a powerful tool for various image and video synthesis tasks, allowing the synthesis of visual content in an unconditional or input-conditional manner.
After learning a new object category from image-level annotations (with no object bounding boxes), humans are remarkably good at precisely localizing those objects.
Inspired by the robustness and efficiency of sparse representation in sparse coding based image restoration models, we investigate the sparsity of neurons in deep networks.
In automatic speech recognition (ASR), model pruning is a widely adopted technique that reduces model size and latency to deploy neural network models on edge devices with resource constraints.
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).
Ranked #9 on Speech Recognition on LibriSpeech test-clean
We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2. 1%/4. 6% without external language model (LM), 1. 9%/4. 1% with LM and 2. 9%/7. 0% with only 10M parameters on the clean/noisy LibriSpeech test sets.
Ranked #9 on Speech Recognition on LibriSpeech test-clean
Self-similarity refers to the image prior widely used in image restoration algorithms that small but similar patterns tend to occur at different locations and scales.
Without extra retraining or post-processing steps, we are able to train a single set of shared weights on ImageNet and use these weights to obtain child models whose sizes range from 200 to 1000 MFLOPs.
Ranked #15 on Neural Architecture Search on ImageNet
In this paper, we show that properly modeling scale-invariance into neural networks can bring significant benefits to image restoration performance.
In this work, we propose BigNAS, an approach that simplifies this workflow and scales up neural architecture search to target a wide range of model sizes simultaneously.
In order to distill diverse knowledge from different trained (teacher) models, we propose to use adversarial-based learning strategy where we define a block-wise training loss to guide and optimize the predefined student network to recover the knowledge in teacher models, and to promote the discriminator network to distinguish teacher vs. student features simultaneously.
Notably, by setting optimized channel numbers, our AutoSlim-MobileNet-v2 at 305M FLOPs achieves 74. 2% top-1 accuracy, 2. 4% better than default MobileNet-v2 (301M FLOPs), and even 0. 2% better than RL-searched MNasNet (317M FLOPs).
FSNet has the same architecture as that of the baseline CNN to be compressed, and each convolution layer of FSNet has the same number of filters from FS as that of the basline CNN in the forward process.
In this paper, we investigate the role of Rademacher complexity in improving generalization of DNNs and propose a novel regularizer rooted in Local Rademacher Complexity (LRC).
We show that by such disentanglement, the contour completion model predicts reasonable contours of objects, and further substantially improves the performance of image inpainting.
Instead of training individual networks with different width configurations, we train a shared network with switchable batch normalization.
While training on samples drawn from independent and identical distribution has been a de facto paradigm for optimizing image classification networks, humans learn new concepts in an easy-to-hard manner and on the selected examples progressively.
Keras-based implementation of WDSR, EDSR and SRGAN for single image super-resolution
Ranked #4 on Multi-Frame Super-Resolution on PROBA-V
We present a generative image inpainting system to complete images with free-form mask and guidance.
Ranked #3 on Image Inpainting on Places2 val
Motivated by these observations, we propose a new deep generative model-based approach which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions.
In this paper, we present a simple and parameter-efficient drop-in module for one-stage object detectors like SSD when learning from scratch (i. e., without pre-trained models).
In present object detection systems, the deep convolutional neural networks (CNNs) are utilized to predict bounding boxes of object candidates, and have gained performance advantages over the traditional region proposal methods.