1 code implementation • 16 Jun 2023 • Huang Xie, Khazar Khorrami, Okko Räsänen, Tuomas Virtanen
Conversely, the results suggest that using only binary relevances defined by captioning-based audio-caption pairs is sufficient for contrastive learning.
no code implementations • 5 Jun 2023 • Khazar Khorrami, María Andrea Cruz Blandón, Tuomas Virtanen, Okko Räsänen
As a result, we find that sequential training with wav2vec 2. 0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms.
1 code implementation • 29 Sep 2021 • Khazar Khorrami, Okko Räsänen
We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies.
1 code implementation • 5 Jul 2021 • Khazar Khorrami, Okko Räsänen
We compare the alignment performance using our proposed evaluation metrics to the semantic retrieval task commonly used to evaluate VGS models.
no code implementations • 24 Jun 2019 • Okko Räsänen, Khazar Khorrami
Earlier research has suggested that human infants might use statistical dependencies between speech and non-linguistic multimodal input to bootstrap their language learning before they know how to segment words from running speech.