Search Results for author: Herman Kamper

Found 72 papers, 35 papers with code

Deep convolutional acoustic word embeddings using word-pair side information

1 code implementation5 Oct 2015 Herman Kamper, Weiran Wang, Karen Livescu

Recent studies have been revisiting whole words as the basic modelling unit in speech recognition and query applications, instead of phonetic units.

speech-recognition Speech Recognition +1

Unsupervised word segmentation and lexicon discovery using acoustic word embeddings

no code implementations9 Mar 2016 Herman Kamper, Aren Jansen, Sharon Goldwater

In settings where only unlabelled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text.

Language Acquisition Language Modelling +1

A segmental framework for fully-unsupervised large-vocabulary speech recognition

5 code implementations22 Jun 2016 Herman Kamper, Aren Jansen, Sharon Goldwater

We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding).

Language Modelling Speech Recognition +1

Weakly supervised spoken term discovery using cross-lingual side information

no code implementations21 Sep 2016 Sameer Bansal, Herman Kamper, Sharon Goldwater, Adam Lopez

Recent work on unsupervised term discovery (UTD) aims to identify and cluster repeated word-like units from audio alone.

Unsupervised neural and Bayesian models for zero-resource speech processing

no code implementations3 Jan 2017 Herman Kamper

Finally, we show that the clusters discovered by the segmental Bayesian model can be made less speaker- and gender-specific by using features from the cAE instead of traditional acoustic features.

Clustering Language Modelling +1

Towards speech-to-text translation without speech recognition

no code implementations EACL 2017 Sameer Bansal, Herman Kamper, Adam Lopez, Sharon Goldwater

We explore the problem of translating speech to text in low-resource scenarios where neither automatic speech recognition (ASR) nor machine translation (MT) are available, but we have training data in the form of audio paired with text translations.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

An embedded segmental K-means model for unsupervised segmentation and clustering of speech

2 code implementations23 Mar 2017 Herman Kamper, Karen Livescu, Sharon Goldwater

Unsupervised segmentation and clustering of unlabelled speech are core problems in zero-resource speech processing.

Bayesian Inference Clustering +2

Visually grounded learning of keyword prediction from untranscribed speech

1 code implementation23 Mar 2017 Herman Kamper, Shane Settle, Gregory Shakhnarovich, Karen Livescu

In this setting of images paired with untranscribed spoken captions, we consider whether computer vision systems can be used to obtain textual labels for the speech.

Language Acquisition TAG

Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings

1 code implementation12 Jun 2017 Shane Settle, Keith Levin, Herman Kamper, Karen Livescu

Query-by-example search often uses dynamic time warping (DTW) for comparing queries and proposed matching segments.

Dynamic Time Warping Word Embeddings

Semantic speech retrieval with a visually grounded model of untranscribed speech

2 code implementations5 Oct 2017 Herman Kamper, Gregory Shakhnarovich, Karen Livescu

We introduce a newly collected data set of human semantic relevance judgements and an associated task, semantic speech retrieval, where the goal is to search for spoken utterances that are semantically relevant to a given text query.

Language Acquisition Retrieval

Low-Resource Speech-to-Text Translation

no code implementations24 Mar 2018 Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, Sharon Goldwater

We explore models trained on between 20 and 160 hours of data, and find that although models trained on less data have considerably lower BLEU scores, they can still predict words with relatively high precision and recall---around 50% for a model trained on 50 hours of data, versus around 60% for the full 160 hour model.

Machine Translation speech-recognition +3

Visually grounded cross-lingual keyword spotting in speech

no code implementations13 Jun 2018 Herman Kamper, Michael Roth

Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available.

Keyword Spotting Visual Grounding

Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring

no code implementations25 Jun 2018 Raghav Menon, Herman Kamper, John Quinn, Thomas Niesler

While the resulting CNN keyword spotter cannot match the performance of the DTW-based system, it substantially outperforms a CNN classifier trained only on the keywords, improving the area under the ROC curve from 0. 54 to 0. 64.

Dynamic Time Warping Humanitarian +2

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

1 code implementation NAACL 2019 Sameer Bansal, Herman Kamper, Karen Livescu, Adam Lopez, Sharon Goldwater

Finally, we show that the approach improves performance on a true low-resource task: pre-training on a combination of English ASR and French ASR improves Mboshi-French ST, where only 4 hours of data are available, from 3. 5 to 7. 1

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models

2 code implementations1 Nov 2018 Herman Kamper

We investigate unsupervised models that can map a variable-duration speech segment to a fixed-dimensional representation.

Word Embeddings

Critical initialisation for deep signal propagation in noisy rectifier neural networks

1 code implementation NeurIPS 2018 Arnu Pretorius, Elan van Biljon, Steve Kroon, Herman Kamper

Simulations and experiments on real-world data confirm that our proposed initialisation is able to stably propagate signals in deep networks, while using an initialisation disregarding noise fails to do so.

Multimodal One-Shot Learning of Speech and Images

2 code implementations9 Nov 2018 Ryan Eloff, Herman A. Engelbrecht, Herman Kamper

Imagine a robot is shown new concepts visually together with spoken tags, e. g. "milk", "eggs", "butter".

Dynamic Time Warping One-Shot Learning

Multilingual and Unsupervised Subword Modeling for Zero-Resource Languages

1 code implementation9 Nov 2018 Enno Hermann, Herman Kamper, Sharon Goldwater

Here we directly compare multiple methods, including some that use only target language speech data and some that use transcribed speech from other (non-target) languages, and we evaluate using two intrinsic measures as well as on a downstream unsupervised word segmentation and clustering task.

Clustering

Semantic query-by-example speech search using visual grounding

1 code implementation15 Apr 2019 Herman Kamper, Aristotelis Anastassiou, Karen Livescu

A number of recent studies have started to investigate how speech systems can be trained on untranscribed speech by leveraging accompanying images at training time.

Retrieval Semantic Retrieval +1

On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval

no code implementations24 Apr 2019 Ankita Pasad, Bowen Shi, Herman Kamper, Karen Livescu

Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision.

Retrieval Visual Grounding

Cross-lingual topic prediction for speech using translations

no code implementations29 Aug 2019 Sameer Bansal, Herman Kamper, Adam Lopez, Sharon Goldwater

Given a large amount of unannotated speech in a low-resource language, can we classify the speech utterances by topic?

Humanitarian Speech-to-Text Translation +1

On the expected behaviour of noise regularised deep neural networks as Gaussian processes

no code implementations12 Oct 2019 Arnu Pretorius, Herman Kamper, Steve Kroon

Recent work has established the equivalence between deep neural networks and Gaussian processes (GPs), resulting in so-called neural network Gaussian processes (NNGPs).

Gaussian Processes

If dropout limits trainable depth, does critical initialisation still matter? A large-scale statistical analysis on ReLU networks

no code implementations13 Oct 2019 Arnu Pretorius, Elan van Biljon, Benjamin van Niekerk, Ryan Eloff, Matthew Reynard, Steve James, Benjamin Rosman, Herman Kamper, Steve Kroon

Our results therefore suggest that, in the shallow-to-moderate depth setting, critical initialisation provides zero performance gains when compared to off-critical initialisations and that searching for off-critical initialisations that might improve training speed or generalisation, is likely to be a fruitless endeavour.

BINet: a binary inpainting network for deep patch-based image compression

1 code implementation11 Dec 2019 André Nortje, Willie Brink, Herman A. Engelbrecht, Herman Kamper

We propose the Binary Inpainting Network (BINet), an autoencoder framework which incorporates binary inpainting to reinstate interdependencies between adjacent patches, for improved patch-based compression of still images.

Image Compression

Deep motion estimation for parallel inter-frame prediction in video compression

1 code implementation11 Dec 2019 André Nortje, Herman A. Engelbrecht, Herman Kamper

Standard video codecs rely on optical flow to guide inter-frame prediction: pixels from reference frames are moved via motion vectors to predict target video frames.

Motion Estimation Optical Flow Estimation +1

Unsupervised feature learning for speech using correspondence and Siamese networks

no code implementations28 Mar 2020 Petri-Johan Last, Herman A. Engelbrecht, Herman Kamper

Dynamic programming is then used to align the feature frames between each word pair, serving as weak top-down supervision for the two models.

Analyzing autoencoder-based acoustic word embeddings

no code implementations3 Apr 2020 Yevgen Matusevych, Herman Kamper, Sharon Goldwater

To better understand the applications of AWEs in various downstream tasks and in cognitive modeling, we need to analyze the representation spaces of AWEs.

Word Embeddings

Improved acoustic word embeddings for zero-resource languages using multilingual transfer

1 code implementation2 Jun 2020 Herman Kamper, Yevgen Matusevych, Sharon Goldwater

We consider three multilingual recurrent neural network (RNN) models: a classifier trained on the joint vocabularies of all training languages; a Siamese RNN trained to discriminate between same and different words from multiple languages; and a correspondence autoencoder (CAE) RNN trained to reconstruct word pairs.

speech-recognition Speech Recognition +1

Evaluating computational models of infant phonetic learning across languages

no code implementations6 Aug 2020 Yevgen Matusevych, Thomas Schatz, Herman Kamper, Naomi H. Feldman, Sharon Goldwater

In the first year of life, infants' speech perception becomes attuned to the sounds of their native language.

Unsupervised vs. transfer learning for multimodal one-shot matching of speech and images

1 code implementation14 Aug 2020 Leanne Nortje, Herman Kamper

Here we compare transfer learning to unsupervised models trained on unlabelled in-domain data.

Transfer Learning

A Correspondence Variational Autoencoder for Unsupervised Acoustic Word Embeddings

no code implementations3 Dec 2020 Puyuan Peng, Herman Kamper, Karen Livescu

We propose a new unsupervised model for mapping a variable-duration speech segment to a fixed-dimensional representation.

Word Embeddings

Direct multimodal few-shot learning of speech and images

1 code implementation10 Dec 2020 Leanne Nortje, Herman Kamper

We propose direct multimodal few-shot models that learn a shared embedding space of spoken words and images from only a few paired examples.

Few-Shot Learning Transfer Learning

Towards localisation of keywords in speech using weak supervision

no code implementations14 Dec 2020 Kayode Olaleye, Benjamin van Niekerk, Herman Kamper

Of the two forms of supervision, the visually trained model performs worse than the BoW-trained model.

Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks

no code implementations14 Dec 2020 Herman Kamper, Benjamin van Niekerk

We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units.

Clustering Segmentation

A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings

no code implementations14 Dec 2020 Lisa van Staden, Herman Kamper

We compare frame-level features from contrastive predictive coding (CPC), autoregressive predictive coding and a CAE to conventional MFCCs.

Representation Learning Word Embeddings

A phonetic model of non-native spoken word processing

no code implementations EACL 2021 Yevgen Matusevych, Herman Kamper, Thomas Schatz, Naomi H. Feldman, Sharon Goldwater

We then test the model on a spoken word processing task, showing that phonology may not be necessary to explain some of the word processing effects observed in non-native speakers.

Attribute

StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts

no code implementations31 May 2021 Matthew Baas, Herman Kamper

We specifically extend the recent StarGAN-VC model by conditioning it on a speaker embedding (from a potentially unseen speaker).

Voice Conversion

Attention-Based Keyword Localisation in Speech using Visual Grounding

no code implementations16 Jun 2021 Kayode Olaleye, Herman Kamper

Visually grounded speech models learn from images paired with spoken captions.

Visual Grounding

Feature learning for efficient ASR-free keyword spotting in low-resource languages

no code implementations13 Aug 2021 Ewald van der Westhuizen, Herman Kamper, Raghav Menon, John Quinn, Thomas Niesler

We show that, using these features, the CNN-DTW keyword spotter performs almost as well as the DTW keyword spotter while outperforming a baseline CNN trained only on the keyword templates.

Dynamic Time Warping Humanitarian +1

Voice Conversion Can Improve ASR in Very Low-Resource Settings

no code implementations4 Nov 2021 Matthew Baas, Herman Kamper

In this work we assess whether a VC system can be used cross-lingually to improve low-resource speech recognition.

Data Augmentation speech-recognition +2

Towards Learning to Speak and Hear Through Multi-Agent Communication over a Continuous Acoustic Channel

no code implementations4 Nov 2021 Kevin Eloff, Okko Räsänen, Herman A. Engelbrecht, Arnu Pretorius, Herman Kamper

Multi-agent reinforcement learning has been used as an effective means to study emergent communication between agents, yet little focus has been given to continuous acoustic communication.

Language Acquisition Multi-agent Reinforcement Learning +3

Keyword localisation in untranscribed speech using visually grounded speech models

1 code implementation2 Feb 2022 Kayode Olaleye, Dan Oneata, Herman Kamper

Masked-based localisation gives some of the best reported localisation scores from a VGS model, with an accuracy of 57% when the system knows that a keyword occurs in an utterance and need to predict its location.

Keyword Spotting TAG

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

3 code implementations24 Feb 2022 Herman Kamper

This paper instead revisits an older approach to word segmentation: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units (without influencing the lower level).

Acoustic Unit Discovery Segmentation

YFACC: A Yorùbá speech-image dataset for cross-lingual keyword localisation through visual grounding

no code implementations10 Oct 2022 Kayode Olaleye, Dan Oneata, Herman Kamper

We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yor\`ub\'a -- a real low-resource language spoken in Nigeria.

Visual Grounding

GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

1 code implementation11 Oct 2022 Matthew Baas, Herman Kamper

As in the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer.

Disentanglement Generative Adversarial Network +2

Towards visually prompted keyword localisation for zero-resource spoken languages

1 code implementation12 Oct 2022 Leanne Nortje, Herman Kamper

We formalise this task and call it visually prompted keyword localisation (VPKL): given an image of a keyword, detect and predict where in an utterance the keyword occurs.

TransFusion: Transcribing Speech with Multinomial Diffusion

1 code implementation14 Oct 2022 Matthew Baas, Kevin Eloff, Herman Kamper

In this work we aim to see whether the benefits of diffusion models can also be realized for speech recognition.

Denoising Image Generation +3

Mitigating Catastrophic Forgetting for Few-Shot Spoken Word Classification Through Meta-Learning

1 code implementation22 May 2023 Ruan van der Merwe, Herman Kamper

We consider the problem of few-shot spoken word classification in a setting where a model is incrementally introduced to new word classes.

Continual Learning Meta-Learning

Visually grounded few-shot word acquisition with fewer shots

no code implementations25 May 2023 Leanne Nortje, Benjamin van Niekerk, Herman Kamper

Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.

Voice Conversion With Just Nearest Neighbors

1 code implementation30 May 2023 Matthew Baas, Benjamin van Niekerk, Herman Kamper

Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference.

 Ranked #1 on Voice Conversion on LibriSpeech test-clean (using extra training data)

Voice Conversion

Towards hate speech detection in low-resource languages: Comparing ASR to acoustic word embeddings on Wolof and Swahili

no code implementations1 Jun 2023 Christiaan Jacobs, Nathanaël Carraz Rakotonirina, Everlyn Asiko Chimoto, Bruce A. Bassett, Herman Kamper

But in an in-the-wild test on Swahili radio broadcasts with actual hate speech keywords, the AWE model (using one minute of template data) is more robust, giving similar performance to an ASR system trained on 30 hours of labelled data.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Visually grounded few-shot word learning in low-resource settings

no code implementations20 Jun 2023 Leanne Nortje, Dan Oneata, Herman Kamper

We propose an approach that can work on natural word-image pairs but with less examples, i. e. fewer shots, and then illustrate how this approach can be applied for multimodal few-shot learning in a real low-resource language, Yor\`ub\'a.

Few-Shot Learning

Disentanglement in a GAN for Unconditional Speech Synthesis

1 code implementation4 Jul 2023 Matthew Baas, Herman Kamper

We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training.

Disentanglement Generative Adversarial Network +5

Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings

no code implementations5 Jul 2023 Christiaan Jacobs, Herman Kamper

Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech segments that encode phonetic content so that different realisations of the same word have similar embeddings.

Word Embeddings Word Similarity

Rhythm Modeling for Voice Conversion

1 code implementation12 Jul 2023 Benjamin van Niekerk, Marc-André Carbonneau, Herman Kamper

Voice conversion aims to transform source speech into a different target voice.

Voice Conversion

Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

no code implementations12 Oct 2023 Matthew Baas, Herman Kamper

Nevertheless, this shows that voice conversion models - and kNN-VC in particular - are increasingly applicable in a range of non-standard downstream tasks.

Voice Conversion

Visually Grounded Speech Models have a Mutual Exclusivity Bias

no code implementations20 Mar 2024 Leanne Nortje, Dan Oneaţă, Yevgen Matusevych, Herman Kamper

To simulate prior acoustic and visual knowledge, we experiment with several initialisation strategies using pretrained speech and vision networks.

LiSTra Automatic Speech Translation: English to Lingala Case Study

no code implementations DCLRL (LREC) 2022 Salomon Kabongo Kabenamualu, Vukosi Marivate, Herman Kamper

In recent years there has been great interest in addressing the data scarcity of African languages and providing baseline models for different Natural Language Processing tasks (Orife et al., 2020).

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Cannot find the paper you are looking for? You can Submit a new open access paper.