Search Results for author: Tuomas Virtanen

Found 73 papers, 34 papers with code

Reference Channel Selection by Multi-Channel Masking for End-to-End Multi-Channel Speech Enhancement

no code implementations5 Jun 2024 Wang Dai, Xiaofei Li, Archontis Politis, Tuomas Virtanen

The experimental results on the Spear challenge simulated dataset D4 demonstrate the superiority of our proposed method over the conventional approach of using a fixed reference channel with single-channel masking.

Speech Enhancement

Speaker Distance Estimation in Enclosures from Single-Channel Audio

1 code implementation26 Mar 2024 Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen

Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling.

From Weak to Strong Sound Event Labels using Adaptive Change-Point Detection and Active Learning

1 code implementation13 Mar 2024 John Martinsson, Olof Mogren, Maria Sandsten, Tuomas Virtanen

In this work we propose an audio recording segmentation method based on an adaptive change point detection (A-CPD) for machine guided weak label annotation of audio recording segments.

Active Learning Change Point Detection

Neural Ambisonics encoding for compact irregular microphone arrays

no code implementations11 Jan 2024 Mikko Heikkinen, Archontis Politis, Tuomas Virtanen

Ambisonics encoding of microphone array signals can enable various spatial audio applications, such as virtual reality or telepresence, but it is typically designed for uniformly-spaced spherical microphone arrays.

Attention-Driven Multichannel Speech Enhancement in Moving Sound Source Scenarios

no code implementations17 Dec 2023 Yuzhu Wang, Archontis Politis, Tuomas Virtanen

The clean speech clips from WSJ0 are employed for simulating speech signals of moving speakers in a reverberant environment.

Speech Enhancement

Representation Learning for Audio Privacy Preservation using Source Separation and Robust Adversarial Learning

no code implementations9 Aug 2023 Diep Luong, Minh Tran, Shayan Gharib, Konstantinos Drossos, Tuomas Virtanen

Privacy preservation has long been a concern in smart acoustic monitoring systems, where speech can be passively recorded along with a target signal in the system's operating environment.

Privacy Preserving Representation Learning

Crowdsourcing and Evaluating Text-Based Audio Retrieval Relevances

1 code implementation16 Jun 2023 Huang Xie, Khazar Khorrami, Okko Räsänen, Tuomas Virtanen

Conversely, the results suggest that using only binary relevances defined by captioning-based audio-caption pairs is sufficient for contrastive learning.

Audio captioning Contrastive Learning +1

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

1 code implementation NeurIPS 2023 Kazuki Shimada, Archontis Politis, Parthasaarathy Sudarsanam, Daniel Krause, Kengo Uchida, Sharath Adavanne, Aapo Hakala, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Tuomas Virtanen, Yuki Mitsufuji

While direction of arrival (DOA) of sound events is generally estimated from multichannel audio data recorded in a microphone array, sound events usually derive from visually perceptible source objects, e. g., sounds of footsteps come from the feet of a walker.

Sound Event Localization and Detection

Permutation Invariant Recurrent Neural Networks for Sound Source Tracking Applications

no code implementations14 Jun 2023 David Diaz-Guerra, Archontis Politis, Antonio Miguel, Jose R. Beltran, Tuomas Virtanen

Conventional recurrent neural networks (RNNs), such as the long short-term memories (LSTMs) or the gated recurrent units (GRUs), take a vector as their input and use another vector to store their state.

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

no code implementations5 Jun 2023 Khazar Khorrami, María Andrea Cruz Blandón, Tuomas Virtanen, Okko Räsänen

As a result, we find that sequential training with wav2vec 2. 0 first and VGS next provides higher performance on audio-visual retrieval compared to simultaneous optimization of both learning mechanisms.

Multi-Task Learning Representation Learning +4

Attention-Based Methods For Audio Question Answering

no code implementations31 May 2023 Parthasaarathy Sudarsanam, Tuomas Virtanen

On the yes/no binary classification task, our proposed model achieves an accuracy of 68. 3% compared to 62. 7% in the reference model.

Audio Question Answering Binary Classification +1

Adversarial Representation Learning for Robust Privacy Preservation in Audio

1 code implementation29 Apr 2023 Shayan Gharib, Minh Tran, Diep Luong, Konstantinos Drossos, Tuomas Virtanen

In this study, we propose a novel adversarial training method for learning representations of audio recordings that effectively prevents the detection of speech activity from the latent features of the recordings.

Event Detection Representation Learning +1

Multi-Channel Masking with Learnable Filterbank for Sound Source Separation

no code implementations14 Mar 2023 Wang Dai, Archontis Politis, Tuomas Virtanen

Specifically, each mask is used to multiply the corresponding channel's 2D representation, and the masked output of all channels are then summed.

On Negative Sampling for Contrastive Audio-Text Retrieval

no code implementations8 Nov 2022 Huang Xie, Okko Räsänen, Tuomas Virtanen

With a constant training setting on the retrieval system from [1], we study eight sampling strategies, including hard and semi-hard negative sampling.

Audio to Text Retrieval Contrastive Learning +1

Position tracking of a varying number of sound sources with sliding permutation invariant training

no code implementations26 Oct 2022 David Diaz-Guerra, Archontis Politis, Tuomas Virtanen

Recent data- and learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios.

Position

Language-based Audio Retrieval Task in DCASE 2022 Challenge

no code implementations20 Sep 2022 Huang Xie, Samuel Lipping, Tuomas Virtanen

Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset.

Audio captioning Retrieval

Domestic Activity Clustering from Audio via Depthwise Separable Convolutional Autoencoder Network

1 code implementation4 Aug 2022 Yanxiong Li, Wenchang Cao, Konstantinos Drossos, Tuomas Virtanen

Automatic estimation of domestic activities from audio can be used to solve many problems, such as reducing the labor cost for nursing the elderly people.

Clustering

Language-based Audio Retrieval Task in DCASE 2022 Challenge

1 code implementation13 Jun 2022 Huang Xie, Samuel Lipping, Tuomas Virtanen

Language-based audio retrieval is a task, where natural language textual captions are used as queries to retrieve audio signals from a dataset.

Audio captioning Retrieval

Zero-Shot Audio Classification using Image Embeddings

no code implementations10 Jun 2022 Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen

The results show that the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddings can reach up to the semantic acoustic embeddings when the seen and unseen classes are semantically similar.

Audio Classification Zero-shot Audio Classification +1

Low-complexity acoustic scene classification in DCASE 2022 Challenge

no code implementations8 Jun 2022 Irene Martín-Morató, Francesco Paissan, Alberto Ancilotto, Toni Heittola, Annamaria Mesaros, Elisabetta Farella, Alessio Brutti, Tuomas Virtanen

The provided baseline system is a convolutional neural network which employs post-training quantization of parameters, resulting in 46. 5 K parameters, and 29. 23 million multiply-and-accumulate operations (MMACs).

Acoustic Scene Classification Classification +2

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

2 code implementations4 Jun 2022 Archontis Politis, Kazuki Shimada, Parthasaarathy Sudarsanam, Sharath Adavanne, Daniel Krause, Yuichiro Koyama, Naoya Takahashi, Shusuke Takahashi, Yuki Mitsufuji, Tuomas Virtanen

Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format.

Sound Event Localization and Detection

Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering

no code implementations20 Apr 2022 Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos, Tuomas Virtanen

Audio question answering (AQA) is a multimodal translation task where a system analyzes an audio signal and a natural language question, to generate a desirable natural language answer.

Audio Question Answering Question Answering

Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers

2 code implementations29 Oct 2021 Sharath Adavanne, Archontis Politis, Tuomas Virtanen

Data-based and learning-based sound source localization (SSL) has shown promising results in challenging conditions, and is commonly set as a classification or a regression problem.

Classification Direction of Arrival Estimation +2

Unsupervised Audio-Caption Aligning Learns Correspondences between Individual Sound Events and Textual Phrases

1 code implementation6 Oct 2021 Huang Xie, Okko Räsänen, Konstantinos Drossos, Tuomas Virtanen

We investigate unsupervised learning of correspondences between sound events and textual phrases through aligning audio clips with textual captions describing the content of a whole audio clip.

Event Detection Retrieval +1

Sound Event Detection: A Tutorial

1 code implementation12 Jul 2021 Annamaria Mesaros, Toni Heittola, Tuomas Virtanen, Mark D. Plumbley

The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening.

BIG-bench Machine Learning Event Detection +1

Mobile Microphone Array Speech Detection and Localization in Diverse Everyday Environments

no code implementations28 Jun 2021 Pasi Pertilä, Emre Cakir, Aapo Hakala, Eemi Fagerlund, Tuomas Virtanen, Archontis Politis, Antti Eronen

Joint sound event localization and detection (SELD) is an integral part of developing context awareness into communication interfaces of mobile robots, smartphones, and home assistants.

Sound Event Localization and Detection

Deep neural network Based Low-latency Speech Separation with Asymmetric analysis-Synthesis Window Pair

no code implementations22 Jun 2021 Shanshan Wang, Gaurav Naithani, Archontis Politis, Tuomas Virtanen

Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation.

Clustering Deep Clustering +2

Low-complexity acoustic scene classification for multi-device audio: analysis of DCASE 2021 Challenge systems

1 code implementation28 May 2021 Irene Martín-Morató, Toni Heittola, Annamaria Mesaros, Tuomas Virtanen

The most used techniques among the submissions were residual networks and weight quantization, with the top systems reaching over 70% accuracy, and log loss under 0. 8.

Acoustic Scene Classification Quantization +1

Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections

no code implementations25 Nov 2020 Huang Xie, Okko Räsänen, Tuomas Virtanen

In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes.

Audio Classification General Classification +2

Zero-Shot Audio Classification via Semantic Embeddings

no code implementations24 Nov 2020 Huang Xie, Tuomas Virtanen

The experimental results show that classification performance is significantly improved by involving sound classes that are semantically close to the test classes in training.

Audio Classification General Classification +4

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

1 code implementation27 Oct 2020 Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism.

cross-modal alignment Representation Learning +2

Neural Network-based Acoustic Vehicle Counting

no code implementations22 Oct 2020 Slobodan Djukanović, Yash Patel, Jiři Matas, Tuomas Virtanen

This distance is predicted from audio using a two-stage (coarse-fine) regression, with both stages realised via neural networks (NNs).

Distance regression regression

Robust Audio-Based Vehicle Counting in Low-to-Moderate Traffic Flow

no code implementations22 Oct 2020 Slobodan Djukanović, Jiři Matas, Tuomas Virtanen

The method is trained and tested on a traffic-monitoring dataset comprising $422$ short, $20$-second one-channel sound files with a total of $ 1421 $ vehicles passing by the microphone.

regression

WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

1 code implementation21 Oct 2020 An Tran, Konstantinos Drossos, Tuomas Virtanen

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i. e. a caption) of its contents.

Audio captioning Decoder +3

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

4 code implementations6 Sep 2020 Archontis Politis, Annamaria Mesaros, Sharath Adavanne, Toni Heittola, Tuomas Virtanen

A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset.

Data Augmentation Sound Event Localization and Detection

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

1 code implementation9 Jul 2020 Emre Çakır, Konstantinos Drossos, Tuomas Virtanen

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio.

Audio captioning Decoder

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

1 code implementation6 Jul 2020 Khoa Nguyen, Konstantinos Drossos, Tuomas Virtanen

In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence.

Audio captioning

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

2 code implementations15 Jun 2020 Xavier Favory, Konstantinos Drossos, Tuomas Virtanen, Xavier Serra

Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features.

Representation Learning

A Dataset of Reverberant Spatial Sound Scenes with Moving Sources for Sound Event Localization and Detection

2 code implementations2 Jun 2020 Archontis Politis, Sharath Adavanne, Tuomas Virtanen

This report presents the dataset and the evaluation setup of the Sound Event Localization & Detection (SELD) task for the DCASE 2020 Challenge.

Sound Event Localization and Detection

Active Learning for Sound Event Detection

no code implementations12 Feb 2020 Shuyang Zhao, Toni Heittola, Tuomas Virtanen

Training with recordings as context outperforms training with only annotated segments.

Active Learning Change Point Detection +2

Sound Event Detection with Depthwise Separable and Dilated Convolutions

1 code implementation2 Feb 2020 Konstantinos Drossos, Stylianos I. Mimilakis, Shayan Gharib, Yanxiong Li, Tuomas Virtanen

The number of the channels of the CNNs and size of the weight matrices of the RNNs have a direct effect on the total amount of parameters of the SED method, which is to a couple of millions.

Event Detection Sound Event Detection

Memory Requirement Reduction of Deep Neural Networks Using Low-bit Quantization of Parameters

no code implementations1 Nov 2019 Niccoló Nicodemo, Gaurav Naithani, Konstantinos Drossos, Tuomas Virtanen, Roberto Saletti

The application of the low-bit quantization allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by 2. 7%.

Quantization Speech Enhancement

Clotho: An Audio Captioning Dataset

7 code implementations21 Oct 2019 Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen

Audio captioning is the novel task of general audio content description using free text.

Audio captioning Diversity +1

Crowdsourcing a Dataset of Audio Captions

1 code implementation22 Jul 2019 Samuel Lipping, Konstantinos Drossos, Tuomas Virtanen

In this paper we present a three steps based framework for crowdsourcing an audio captioning dataset, based on concepts and practises followed for the creation of widely used image captioning and machine translations datasets.

Sound Audio and Speech Processing

A multi-room reverberant dataset for sound event localization and detection

3 code implementations21 May 2019 Sharath Adavanne, Archontis Politis, Tuomas Virtanen

This paper presents the sound event localization and detection (SELD) task setup for the DCASE 2019 challenge.

Sound Audio and Speech Processing

Zero-Shot Audio Classification Based on Class Label Embeddings

no code implementations6 May 2019 Huang Xie, Tuomas Virtanen

We treat textual labels as semantic side information of audio classes, and use Word2Vec to generate class label embeddings.

Audio Classification General Classification +2

Deep Learning for Audio Signal Processing

1 code implementation30 Apr 2019 Hendrik Purwins, Bo Li, Tuomas Virtanen, Jan Schlüter, Shuo-Yiin Chang, Tara Sainath

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing.

Audio Signal Processing Automatic Speech Recognition +5

Localization, Detection and Tracking of Multiple Moving Sound Sources with a Convolutional Recurrent Neural Network

1 code implementation29 Apr 2019 Sharath Adavanne, Archontis Politis, Tuomas Virtanen

This paper investigates the joint localization, detection, and tracking of sound events using a convolutional recurrent neural network (CRNN).

Unsupervised Adversarial Domain Adaptation Based On The Wasserstein Distance For Acoustic Scene Classification

1 code implementation24 Apr 2019 Konstantinos Drossos, Paul Magron, Tuomas Virtanen

A challenging problem in deep learning-based machine listening field is the degradation of the performance when using data from unseen conditions.

Acoustic Scene Classification Classification +3

Unsupervised adversarial domain adaptation for acoustic scene classification

1 code implementation17 Aug 2018 Shayan Gharib, Konstantinos Drossos, Emre Çakır, Dmitriy Serdyuk, Tuomas Virtanen

A general problem in acoustic scene classification task is the mismatched conditions between training and testing data, which significantly reduces the performance of the developed methods on classification accuracy.

Acoustic Scene Classification Classification +3

Acoustic Scene Classification: A Competition Review

no code implementations2 Aug 2018 Shayan Gharib, Honain Derrar, Daisuke Niizumi, Tuukka Senttula, Janne Tommola, Toni Heittola, Tuomas Virtanen, Heikki Huttunen

In this paper we study the problem of acoustic scene classification, i. e., categorization of audio sequences into mutually exclusive classes based on their spectral content.

Acoustic Scene Classification Classification +2

A multi-device dataset for urban acoustic scene classification

2 code implementations25 Jul 2018 Annamaria Mesaros, Toni Heittola, Tuomas Virtanen

This paper introduces the acoustic scene classification task of DCASE 2018 Challenge and the TUT Urban Acoustic Scenes 2018 dataset provided for the task, and evaluates the performance of a baseline system in the task.

Acoustic Scene Classification Classification +1

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

8 code implementations30 Jun 2018 Sharath Adavanne, Archontis Politis, Joonas Nikunen, Tuomas Virtanen

In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space.

Sound Audio and Speech Processing

End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input

no code implementations9 May 2018 Emre Çakır, Tuomas Virtanen

Sound event detection systems typically consist of two stages: extracting hand-crafted features from the raw audio waveform, and learning a mapping between these features and the target sound events using a classifier.

Event Detection Sound Event Detection

MaD TwinNet: Masker-Denoiser Architecture with Twin Networks for Monaural Sound Source Separation

2 code implementations1 Feb 2018 Konstantinos Drossos, Stylianos Ioannis Mimilakis, Dmitriy Serdyuk, Gerald Schuller, Tuomas Virtanen, Yoshua Bengio

Current state of the art (SOTA) results in monaural singing voice separation are obtained with deep learning based methods.

Sound Audio and Speech Processing

Multichannel Sound Event Detection Using 3D Convolutional Neural Networks for Learning Inter-channel Features

no code implementations29 Jan 2018 Sharath Adavanne, Archontis Politis, Tuomas Virtanen

Each of this dataset has a four-channel first-order Ambisonic, binaural, and single-channel versions, on which the performance of SED using the proposed method are compared to study the potential of SED using multichannel audio.

Event Detection Sound Event Detection

Automated Audio Captioning with Recurrent Neural Networks

no code implementations30 Jun 2017 Konstantinos Drossos, Sharath Adavanne, Tuomas Virtanen

The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder.

Audio captioning Decoder +4

Sound Event Detection in Multichannel Audio Using Spatial and Harmonic Features

no code implementations7 Jun 2017 Sharath Adavanne, Giambattista Parascandolo, Pasi Pertilä, Toni Heittola, Tuomas Virtanen

In this paper, we propose the use of spatial and harmonic features in combination with long short term memory (LSTM) recurrent neural network (RNN) for automatic sound event detection (SED) task.

Event Detection Sound Event Detection

Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection

no code implementations7 Jun 2017 Sharath Adavanne, Konstantinos Drossos, Emre Çakır, Tuomas Virtanen

This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks.

Bird Audio Detection Data Augmentation +1

Convolutional Recurrent Neural Networks for Bird Audio Detection

no code implementations7 Mar 2017 EmreÇakır, Sharath Adavanne, Giambattista Parascandolo, Konstantinos Drossos, Tuomas Virtanen

Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions.

Bird Audio Detection

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

1 code implementation21 Feb 2017 Emre Çakır, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, Tuomas Virtanen

Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure.

Event Detection Sound Event Detection

Recurrent Neural Networks for Polyphonic Sound Event Detection in Real Life Recordings

2 code implementations4 Apr 2016 Giambattista Parascandolo, Heikki Huttunen, Tuomas Virtanen

In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs).

Data Augmentation Event Detection +1

Cannot find the paper you are looking for? You can Submit a new open access paper.