3 code implementations • arXiv 2023 • Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli
Expanding the language coverage of speech technology has the potential to improve access to information for many more people.
1 code implementation • 5 Jan 2022 • Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.
Ranked #2 on Audio-Visual Speech Recognition on LRS3-TED (using extra training data)
Audio-Visual Speech Recognition Automatic Speech Recognition +5
2 code implementations • ICLR 2022 • Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, Abdelrahman Mohamed
The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.
Ranked #1 on Speech Recognition on LRS3-TED (using extra training data)
Automatic Speech Recognition Automatic Speech Recognition (ASR) +4
1 code implementation • 15 May 2022 • Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu
This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs.
1 code implementation • 14 Jul 2022 • Wei-Ning Hsu, Bowen Shi
By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models.
1 code implementation • 1 Mar 2023 • Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang
We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages.
Audio-Visual Speech Recognition Robust Speech Recognition +4
1 code implementation • 5 Sep 2023 • Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, Armen Aghajanyan
It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs.
Ranked #2 on Text-to-Image Generation on MS COCO
1 code implementation • 8 Nov 2022 • Ankita Pasad, Bowen Shi, Karen Livescu
We further investigate the utility of our analyses for downstream tasks by comparing the property trends with performance on speech recognition and spoken language understanding tasks.
2 code implementations • ICCV 2019 • Bowen Shi, Aurora Martinez Del Rio, Jonathan Keane, Diane Brentari, Greg Shakhnarovich, Karen Livescu
In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media.
1 code implementation • 25 May 2022 • Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu
Existing work on sign language translation - that is, translation from sign language videos into sentences in a written language - has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings.
1 code implementation • WS 2020 • Shubham Toshniwal, Haoyue Shi, Bowen Shi, Lingyu Gao, Karen Livescu, Kevin Gimpel
Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution.
1 code implementation • CVPR 2021 • Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu
We propose a benchmark and a suite of evaluation metrics, some of which reflect the effect of detection on the downstream fingerspelling recognition task.
1 code implementation • NeurIPS 2023 • Jin Li, Yaoming Wang, Xiaopeng Zhang, Bowen Shi, Dongsheng Jiang, Chenglin Li, Wenrui Dai, Hongkai Xiong, Qi Tian
Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence.
1 code implementation • 23 Nov 2021 • Han Li, Bowen Shi, Wenrui Dai, Yabo Chen, Botao Wang, Yu Sun, Min Guo, Chenlin Li, Junni Zou, Hongkai Xiong
Recent 2D-to-3D human pose estimation works tend to utilize the graph structure formed by the topology of the human skeleton.
Ranked #41 on 3D Human Pose Estimation on MPI-INF-3DHP (AUC metric)
1 code implementation • 7 Jan 2023 • Yuetian Chen, Ruohua Li, Bowen Shi, Peiru Liu, Mei Si
Automated visual story generation aims to produce stories with corresponding illustrations that exhibit coherence, progression, and adherence to characters' emotional development.
1 code implementation • CVPR 2023 • Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Jin Li, Yuchen Liu, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian
To mitigate the computational and storage demands, recent research has explored Parameter-Efficient Fine-Tuning (PEFT), which focuses on tuning a minimal number of parameters for efficient adaptation.
1 code implementation • 17 Jan 2020 • Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, Hongkai Xiong
However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware.
1 code implementation • 1 Jul 2020 • Bowen Shi, Shane Settle, Karen Livescu
We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs.
1 code implementation • 8 May 2023 • Junran Wu, Xueyuan Chen, Bowen Shi, Shangzhe Li, Ke Xu
In contrastive learning, the choice of ``view'' controls the information that the representation captures and influences the performance of the model.
1 code implementation • 22 Jun 2023 • Yuetian Chen, Bowen Shi, Mei Si
Artificial intelligence has made significant progress in natural language processing, with models like GPT-3 demonstrating impressive capabilities.
no code implementations • 9 Oct 2017 • Bowen Shi, Karen Livescu
We introduce a model for fingerspelling recognition that addresses these issues.
no code implementations • 26 Oct 2018 • Bowen Shi, Aurora Martinez Del Rio, Jonathan Keane, Jonathan Michaux, Diane Brentari, Greg Shakhnarovich, Karen Livescu
As the first attempt at fingerspelling recognition in the wild, this work is intended to serve as a baseline for future work on sign language recognition in realistic conditions.
no code implementations • 24 Apr 2019 • Ankita Pasad, Bowen Shi, Herman Kamper, Karen Livescu
Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision.
no code implementations • 29 Apr 2019 • Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang
This paper presents our work of training acoustic event detection (AED) models using unlabeled dataset.
no code implementations • NIPS Workshop CDNNRIA 2018 • Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang
In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models.
no code implementations • 1 Jul 2019 • Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang
Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems.
no code implementations • 21 Feb 2020 • Bowen Shi, Ming Sun, Krishna C. Puvvada, Chieh-Chi Kao, Spyros Matsoukas, Chao Wang
We study few-shot acoustic event detection (AED) in this paper.
no code implementations • 26 Aug 2020 • Bowen Shi, Isaac H. Kim
We derive a universal correction to the ground-state entanglement entropy, which is equal to the logarithm of the total quantum dimension of a set of superselection sectors localized on the domain wall.
Strongly Correlated Electrons High Energy Physics - Theory Quantum Physics
no code implementations • 8 Jun 2021 • Bowen Shi, Xiaopeng Zhang, Haohang Xu, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets regardless of their taxonomy labels, and followed by fine-tuning the pretrained model over specific dataset as usual.
no code implementations • ACL 2022 • Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu
This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before.
no code implementations • 21 Dec 2022 • Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi
Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR.
Ranked #1 on Speech Recognition on EasyCom
no code implementations • 15 Feb 2023 • Han Li, Bowen Shi, Wenrui Dai, Hongwei Zheng, Botao Wang, Yu Sun, Min Guo, Chenlin Li, Junni Zou, Hongkai Xiong
There has been a recent surge of interest in introducing transformers to 3D human pose estimation (HPE) due to their powerful capabilities in modeling long-term dependencies.
no code implementations • 9 Mar 2023 • Ning Liao, Bowen Shi, Xiaopeng Zhang, Min Cao, Junchi Yan, Qi Tian
To explore prompt learning on the generative pre-trained visual model, as well as keeping the task consistency, we propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
no code implementations • CVPR 2023 • Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi
Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR.
no code implementations • 28 Jun 2023 • Bowen Shi, Xiaopeng Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian
In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy, which utilizes both the supervised/CL teacher and the MIM teacher to jointly guide the student model.
no code implementations • 18 Jul 2023 • Hongwei Zheng, Han Li, Bowen Shi, Wenrui Dai, Botao Wan, Yu Sun, Min Guo, Hongkai Xiong
Recent 2D-to-3D human pose estimation (HPE) utilizes temporal consistency across sequences to alleviate the depth ambiguity problem but ignore the action related prior knowledge hidden in the pose sequence.
no code implementations • 10 Aug 2023 • Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux
Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization).
no code implementations • 23 Aug 2023 • Bowen Shi
To address the problem of searching for fingerspelled keywords in raw sign language videos, we propose a novel method that jointly localizes and matches fingerspelling segments to text.
no code implementations • 25 Oct 2023 • Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu
Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data.
no code implementations • 25 Dec 2023 • Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu
Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data.
Ranked #1 on Audio Generation on AudioCaps
no code implementations • 12 Jan 2024 • Bowen Shi, Peisen Zhao, Zichen Wang, Yuhang Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian, Xiaopeng Zhang
Vision-language foundation models, represented by Contrastive language-image pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks.
no code implementations • 14 Feb 2024 • Phillip Rust, Bowen Shi, Skyler Wang, Necati Cihan Camgöz, Jean Maillard
A major impediment to the advancement of sign language translation (SLT) is data scarcity.
no code implementations • 21 Mar 2024 • Hyojung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang
It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes.