Search Results for author: Bowen Shi

Found 43 papers, 20 papers with code

Robust Self-Supervised Audio-Visual Speech Recognition

1 code implementation5 Jan 2022 Bowen Shi, Wei-Ning Hsu, Abdelrahman Mohamed

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe.

Ranked #2 on Audio-Visual Speech Recognition on LRS3-TED (using extra training data)

Audio-Visual Speech Recognition Automatic Speech Recognition +5

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

1 code implementation15 May 2022 Bowen Shi, Abdelrahman Mohamed, Wei-Ning Hsu

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs.

Representation Learning Speaker Verification

u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

1 code implementation14 Jul 2022 Wei-Ning Hsu, Bowen Shi

By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models.

Speaker Verification speech-recognition +1

MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

1 code implementation1 Mar 2023 Mohamed Anwar, Bowen Shi, Vedanuj Goswami, Wei-Ning Hsu, Juan Pino, Changhan Wang

We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages.

Audio-Visual Speech Recognition Robust Speech Recognition +4

Comparative layer-wise analysis of self-supervised speech models

1 code implementation8 Nov 2022 Ankita Pasad, Bowen Shi, Karen Livescu

We further investigate the utility of our analyses for downstream tasks by comparing the property trends with performance on speech recognition and spoken language understanding tasks.

speech-recognition Speech Recognition +1

Fingerspelling recognition in the wild with iterative visual attention

2 code implementations ICCV 2019 Bowen Shi, Aurora Martinez Del Rio, Jonathan Keane, Diane Brentari, Greg Shakhnarovich, Karen Livescu

In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media.

Hand Detection Segmentation +1

Open-Domain Sign Language Translation Learned from Online Video

1 code implementation25 May 2022 Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu

Existing work on sign language translation - that is, translation from sign language videos into sentences in a written language - has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings.

Sign Language Translation Translation

A Cross-Task Analysis of Text Span Representations

1 code implementation WS 2020 Shubham Toshniwal, Haoyue Shi, Bowen Shi, Lingyu Gao, Karen Livescu, Kevin Gimpel

Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution.

coreference-resolution Question Answering

Fingerspelling Detection in American Sign Language

1 code implementation CVPR 2021 Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu

We propose a benchmark and a suite of evaluation metrics, some of which reflect the effect of detection on the downstream fingerspelling recognition task.

Pose Estimation

AiluRus: A Scalable ViT Framework for Dense Prediction

1 code implementation NeurIPS 2023 Jin Li, Yaoming Wang, Xiaopeng Zhang, Bowen Shi, Dongsheng Jiang, Chenglin Li, Wenrui Dai, Hongkai Xiong, Qi Tian

Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence.

object-detection Object Detection +1

Hierarchical Graph Networks for 3D Human Pose Estimation

1 code implementation23 Nov 2021 Han Li, Bowen Shi, Wenrui Dai, Yabo Chen, Botao Wang, Yu Sun, Min Guo, Chenlin Li, Junni Zou, Hongkai Xiong

Recent 2D-to-3D human pose estimation works tend to utilize the graph structure formed by the topology of the human skeleton.

3D Human Pose Estimation

Visual Story Generation Based on Emotion and Keywords

1 code implementation7 Jan 2023 Yuetian Chen, Ruohua Li, Bowen Shi, Peiru Liu, Mei Si

Automated visual story generation aims to produce stories with corresponding illustrations that exhibit coherence, progression, and adherence to characters' emotional development.

Image Generation Object Recognition +2

Adapting Shortcut With Normalizing Flow: An Efficient Tuning Framework for Visual Recognition

1 code implementation CVPR 2023 Yaoming Wang, Bowen Shi, Xiaopeng Zhang, Jin Li, Yuchen Liu, Wenrui Dai, Chenglin Li, Hongkai Xiong, Qi Tian

To mitigate the computational and storage demands, recent research has explored Parameter-Efficient Fine-Tuning (PEFT), which focuses on tuning a minimal number of parameters for efficient adaptation.

Latency-Aware Differentiable Neural Architecture Search

1 code implementation17 Jan 2020 Yuhui Xu, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Bowen Shi, Qi Tian, Hongkai Xiong

However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware.

Neural Architecture Search

Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

1 code implementation1 Jul 2020 Bowen Shi, Shane Settle, Karen Livescu

We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs.

speech-recognition Speech Recognition +1

SEGA: Structural Entropy Guided Anchor View for Graph Contrastive Learning

1 code implementation8 May 2023 Junran Wu, Xueyuan Chen, Bowen Shi, Shangzhe Li, Ke Xu

In contrastive learning, the choice of ``view'' controls the information that the representation captures and influences the performance of the model.

Contrastive Learning Graph Classification +1

Prompt to GPT-3: Step-by-Step Thinking Instructions for Humor Generation

1 code implementation22 Jun 2023 Yuetian Chen, Bowen Shi, Mei Si

Artificial intelligence has made significant progress in natural language processing, with models like GPT-3 demonstrating impressive capabilities.

Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition

no code implementations9 Oct 2017 Bowen Shi, Karen Livescu

We introduce a model for fingerspelling recognition that addresses these issues.

American Sign Language fingerspelling recognition in the wild

no code implementations26 Oct 2018 Bowen Shi, Aurora Martinez Del Rio, Jonathan Keane, Jonathan Michaux, Diane Brentari, Greg Shakhnarovich, Karen Livescu

As the first attempt at fingerspelling recognition in the wild, this work is intended to serve as a baseline for future work on sign language recognition in realistic conditions.

Sign Language Recognition

On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval

no code implementations24 Apr 2019 Ankita Pasad, Bowen Shi, Herman Kamper, Karen Livescu

Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision.

Retrieval Visual Grounding

Compression of Acoustic Event Detection Models with Low-rank Matrix Factorization and Quantization Training

no code implementations NIPS Workshop CDNNRIA 2018 Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang

In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models.

Event Detection Quantization

Compression of Acoustic Event Detection Models With Quantized Distillation

no code implementations1 Jul 2019 Bowen Shi, Ming Sun, Chieh-Chi Kao, Viktor Rozgic, Spyros Matsoukas, Chao Wang

Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems.

Event Detection Knowledge Distillation +1

Domain wall topological entanglement entropy

no code implementations26 Aug 2020 Bowen Shi, Isaac H. Kim

We derive a universal correction to the ground-state entanglement entropy, which is equal to the logarithm of the total quantum dimension of a set of superselection sectors localized on the domain wall.

Strongly Correlated Electrons High Energy Physics - Theory Quantum Physics

Multi-dataset Pretraining: A Unified Model for Semantic Segmentation

no code implementations8 Jun 2021 Bowen Shi, Xiaopeng Zhang, Haohang Xu, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian

This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets regardless of their taxonomy labels, and followed by fine-tuning the pretrained model over specific dataset as usual.

Semantic Segmentation

Searching for fingerspelled content in American Sign Language

no code implementations ACL 2022 Bowen Shi, Diane Brentari, Greg Shakhnarovich, Karen Livescu

This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before.

Retrieval Translation

Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation

no code implementations15 Feb 2023 Han Li, Bowen Shi, Wenrui Dai, Hongwei Zheng, Botao Wang, Yu Sun, Min Guo, Chenlin Li, Junni Zou, Hongkai Xiong

There has been a recent surge of interest in introducing transformers to 3D human pose estimation (HPE) due to their powerful capabilities in modeling long-term dependencies.

3D Human Pose Estimation Position

Rethinking Visual Prompt Learning as Masked Visual Token Modeling

no code implementations9 Mar 2023 Ning Liao, Bowen Shi, Xiaopeng Zhang, Min Cao, Junchi Yan, Qi Tian

To explore prompt learning on the generative pre-trained visual model, as well as keeping the task consistency, we propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.

Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners

no code implementations28 Jun 2023 Bowen Shi, Xiaopeng Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian

In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy, which utilizes both the supervised/CL teacher and the MIM teacher to jointly guide the student model.

Contrastive Learning Representation Learning

ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting

no code implementations18 Jul 2023 Hongwei Zheng, Han Li, Bowen Shi, Wenrui Dai, Botao Wan, Yu Sun, Min Guo, Hongkai Xiong

Recent 2D-to-3D human pose estimation (HPE) utilizes temporal consistency across sequences to alleviate the depth ambiguity problem but ignore the action related prior knowledge hidden in the pose sequence.

3D Human Pose Estimation

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

no code implementations10 Aug 2023 Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization).

Resynthesis Speech Synthesis

Toward American Sign Language Processing in the Real World: Data, Tasks, and Methods

no code implementations23 Aug 2023 Bowen Shi

To address the problem of searching for fingerspelled keywords in raw sign language videos, we propose a novel method that jointly localizes and matches fingerspelling segments to text.

Hand Detection Sign Language Translation +1

Generative Pre-training for Speech with Flow Matching

no code implementations25 Oct 2023 Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu

Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data.

Speech Enhancement Speech Synthesis +1

Audiobox: Unified Audio Generation with Natural Language Prompts

no code implementations25 Dec 2023 Apoorv Vyas, Bowen Shi, Matthew Le, Andros Tjandra, Yi-Chiao Wu, Baishan Guo, Jiemin Zhang, Xinyue Zhang, Robert Adkins, William Ngan, Jeff Wang, Ivan Cruz, Bapi Akula, Akinniyi Akinyemi, Brian Ellis, Rashel Moritz, Yael Yungster, Alice Rakotoarison, Liang Tan, Chris Summers, Carleigh Wood, Joshua Lane, Mary Williamson, Wei-Ning Hsu

Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data.

AudioCaps Audio Generation +1

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

no code implementations12 Jan 2024 Bowen Shi, Peisen Zhao, Zichen Wang, Yuhang Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian, Xiaopeng Zhang

Vision-language foundation models, represented by Contrastive language-image pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks.

Panoptic Segmentation Retrieval +1

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

no code implementations21 Mar 2024 Hyojung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang

It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes.

Audio-Visual Speech Recognition Representation Learning +4

Cannot find the paper you are looking for? You can Submit a new open access paper.