We propose a soft-label sorting network along with the counting network, which sorts the given images by their crowd numbers.
With the use of data augmentation and source separation model, results show that the proposed method achieves a character error rate of less than 18% on a Mandarin polyphonic dataset for lyrics transcription, and a mean absolute error of 0. 071 seconds for lyrics alignment.
Livestreaming commerce, a hybrid of e-commerce and self-media, has expanded the broad spectrum of traditional sales performance determinants.
Objective: Motor Imagery (MI) serves as a crucial experimental paradigm within the realm of Brain Computer Interfaces (BCIs), aiming to decoding motor intentions from electroencephalogram (EEG) signals.
Note-level automatic music transcription is one of the most representative music information retrieval (MIR) tasks and has been studied for various instruments to understand music.
With the rapid development of cloud computing, virtual machine scheduling has become one of the most important but challenging issues for the cloud computing community, especially for practical heterogeneous request sequences.
Based on TDC, we propose the temporal dynamic concept modeling network (TDCMN) to learn an accurate and complete concept representation for efficient untrimmed video analysis.
In this paper, we propose an editing test to evaluate users' editing experience of music generation models in a systematic way.
Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize.
We present and release Omnizart, a new Python library that provides a streamlined solution to automatic music transcription (AMT).
The proposed eSUSAN extracts the univalue segment assimilating nucleus from the circle kernel based on the similarity across timestamps and distinguishes corner events by the number of pixels in the nucleus area.
Inspired by the strong searching capability of neural architecture search (NAS) in CNN, this paper proposes Graph Neural Architecture Search (GNAS) with novel-designed search space.
Its major difference from the traditional image style transfer problem is that the style information is provided by music rather than images.
Weakly supervised referring expression grounding (REG) aims at localizing the referential entity in an image according to linguistic query, where the mapping between the image region (proposal) and the query is unknown in the training stage.
In this paper, we propose a novel Cascaded Partial Decoder (CPD) framework for fast and accurate salient object detection.
Ranked #1 on RGB Salient Object Detection on ISTD
We propose the multi-layered cepstrum (MLC) method to estimate multiple fundamental frequencies (MF0) of a signal under challenging contamination such as high-pass filter noise.
Our experiments on both vocal melody extraction and general melody extraction validate the effectiveness of the proposed model.
A patch-based convolutional neural network (CNN) model presented in this paper for vocal melody extraction in polyphonic music is inspired from object detection in image processing.
Sound Audio and Speech Processing
This paper presents a new approach in understanding how deep neural networks (DNNs) work by applying homomorphic signal processing techniques.