Search Results for author: Yi-Jen Shih

Found 10 papers, 5 papers with code

Measuring Sound Symbolism in Audio-visual Models

no code implementations18 Sep 2024 Wei-Cheng Tseng, Yi-Jen Shih, David Harwath, Raymond Mooney

Audio-visual pre-trained models have gained substantial attention recently and demonstrated superior performance on various audio-visual tasks.

Self-supervised Speech Models for Word-Level Stuttered Speech Detection

no code implementations16 Sep 2024 Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath

In this study, we curated a stuttered speech dataset with word-level annotations and introduced a word-level stuttering speech detection model leveraging self-supervised speech models.

Interface Design for Self-Supervised Speech Models

1 code implementation18 Jun 2024 Yi-Jen Shih, David Harwath

In particular, we show that a convolutional interface whose depth scales logarithmically with the depth of the upstream model consistently outperforms many other interface designs.

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

no code implementations2 Nov 2022 Layne Berry, Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Hung-Yi Lee, David Harwath

This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.

Image Retrieval Text Retrieval

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

1 code implementation3 Oct 2022 Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-Yi Lee, David Harwath

Data-driven speech processing models usually perform well with a large amount of text supervision, but collecting transcribed speech data is costly.

Language Modeling Language Modelling +1

Theme Transformer: Symbolic Music Generation with Theme-Conditioned Transformer

1 code implementation7 Nov 2021 Yi-Jen Shih, Shih-Lun Wu, Frank Zalkow, Meinard Müller, Yi-Hsuan Yang

To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation.

Music Generation Representation Learning Sound Multimedia Audio and Speech Processing

Cannot find the paper you are looking for? You can Submit a new open access paper.