Search Results for author: Puyuan Peng

Found 18 papers, 10 papers with code

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

1 code implementation5 Oct 2024 Alan Baade, Puyuan Peng, David Harwath

For speech in particular, the high resolution of waveforms (16, 000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models.

Clustering Language Modelling +1

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

1 code implementation25 Mar 2024 Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.

Decoder Language Modelling +1

BAT: Learning to Reason about Spatial Sounds with Large Language Models

no code implementations2 Feb 2024 Zhisheng Zheng, Puyuan Peng, Ziyang Ma, Xie Chen, Eunsol Choi, David Harwath

By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment.

Event Detection Language Modelling +5

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

no code implementations27 Jun 2023 Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data.

Multi-Task Learning Scene Understanding +3

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

2 code implementations19 May 2023 Puyuan Peng, Shang-Wen Li, Okko Räsänen, Abdelrahman Mohamed, David Harwath

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective.

Language Modelling Masked Language Modeling +3

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

1 code implementation18 May 2023 Puyuan Peng, Brian Yan, Shinji Watanabe, David Harwath

We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.

Audio-Visual Speech Recognition Prompt Engineering +2

Zero-shot Video Moment Retrieval With Off-the-Shelf Models

no code implementations3 Nov 2022 Anuj Diwan, Puyuan Peng, Raymond J. Mooney

For the majority of the machine learning community, the expensive nature of collecting high-quality human-annotated data and the inability to efficiently finetune very large state-of-the-art pretrained models on limited compute are major bottlenecks for building models for new tasks.

Moment Retrieval Retrieval

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

2 code implementations30 Mar 2022 Alan Baade, Puyuan Peng, David Harwath

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.

Audio Classification Decoder

Fast-Slow Transformer for Visually Grounding Speech

1 code implementation16 Sep 2021 Puyuan Peng, David Harwath

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-VGS.

Image Retrieval Retrieval

A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings

no code implementations3 Dec 2020 Puyuan Peng, Herman Kamper, Karen Livescu

We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.

Word Embeddings

Cannot find the paper you are looking for? You can Submit a new open access paper.