no code implementations • 10 Oct 2022 • Kayode Olaleye, Dan Oneata, Herman Kamper
We collect and release a new single-speaker dataset of audio captions for 6k Flickr images in Yor\`ub\'a -- a real low-resource language spoken in Nigeria.
1 code implementation • 2 Feb 2022 • Kayode Olaleye, Dan Oneata, Herman Kamper
Masked-based localisation gives some of the best reported localisation scores from a VGS model, with an accuracy of 57% when the system knows that a keyword occurs in an utterance and need to predict its location.
no code implementations • 16 Jun 2021 • Kayode Olaleye, Herman Kamper
Visually grounded speech models learn from images paired with spoken captions.
no code implementations • 14 Dec 2020 • Kayode Olaleye, Benjamin van Niekerk, Herman Kamper
Of the two forms of supervision, the visually trained model performs worse than the BoW-trained model.