Precise question understanding is critical for temporal reading comprehension.
The Mixed Error Rate results show that the amount of adaptation data may be as low as $1\sim10$ hours to achieve saturation in performance gain (SEAME) while the ASRU task continued to show performance with more adaptation data ($>$100 hours).
This paper demonstrates that an attacker can extract speaker information by querying speaker-adapted speech recognition (ASR) systems.
Current speaker anonymization methods, especially with self-supervised learning (SSL) models, require massive computational resources when hiding speaker identity.
By combining the prompt and input image, a large vision-language model (i. e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing the environmental objects and scenes, relevant to the prompt.
Out-of-distribution (OOD) detection aims to detect "unknown" data whose labels have not been seen during the in-distribution (ID) training process.
Stable Diffusion (SD) customization approaches enable users to personalize SD model outputs, greatly enhancing the flexibility and diversity of AI art.
Our experiments on multiple datasets demonstrate the effectiveness of SE-Bridge in SE.
First, STDE introduces target videos as patch textures and only adds patches on keyframes that are adaptively selected by temporal difference.
In this paper, we propose a novel fashion image retrieval method leveraging both global and fine-grained features, dubbed Multi-Granular Alignment (MGA).
Ranked #3 on Metric Learning on In-Shop
Then, we analogize patch optimization with regular model optimization, proposing a series of self-ensemble approaches on the input data, the attacked model, and the adversarial patch to efficiently make use of the limited information and prevent the patch from overfitting.
To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders.
Intermediate layer output (ILO) regularization by means of multitask training on encoder side has been shown to be an effective approach to yielding improved results on a wide range of end-to-end ASR frameworks.
More importantly, we train an end-to-end (E2E) speech recognition model by means of merging two monolingual data sets and observe the efficacy of the proposed ILME-based LM fusion for CSSR.
In this paper, we propose a model which combines the complexed spectrogram domain feature and time-domain feature by a cross-domain encoder (CDE) and adopts the hierarchic temporal convolutional network (HTCN) for multiple music sources separation.
Ranked #8 on Music Source Separation on MUSDB18
3 code implementations • 13 Jun 2022 • Luca Gagliardi, Andrea Raffo, Ulderico Fugacci, Silvia Biasotti, Walter Rocchia, Hao Huang, Boulbaba Ben Amor, Yi Fang, Yuanyuan Zhang, Xiao Wang, Charles Christoffer, Daisuke Kihara, Apostolos Axenopoulos, Stelios Mylonas, Petros Daras
This paper presents the methods that have participated in the SHREC 2022 contest on protein-ligand binding site recognition.
Low-resource speech recognition has been long-suffering from insufficient training data.
The performance of current Scene Graph Generation models is severely hampered by some hard-to-distinguish predicates, e. g., "woman-on/standing on/walking on-beach" or "woman-near/looking at/in front of-child".
Deformable image registration plays a critical role in various tasks of medical image analysis.
To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT).
Requirements driven search-based testing (also known as falsification) has proven to be a practical and effective method for discovering erroneous behaviors in Cyber-Physical Systems.
Non-autoregressive end-to-end ASR framework might be potentially appropriate for code-switching recognition task thanks to its inherent property that present output token being independent of historical ones.
In contrast to the PGD-k attack, our method generates adversarial samples that keep the geometric features in clean samples and contain few outliers.
Procedural text understanding aims at tracking the states (e. g., create, move, destroy) and locations of the entities mentioned in a given paragraph.
In this work, we introduce a joint geometric-neural networks approach for comparing, deforming and generating 3D protein structures.
Our ResNet-TW (Deep Residual Network for Time Warping) tackles the alignment problem by compositing a flow of incremental diffeomorphic mappings.
In this paper, we propose a single multi-task learning framework to perform End-to-End (E2E) speech recognition (ASR) and accent recognition (AR) simultaneously.
Then, we design a two-level perturbation fusion strategy to alleviate the conflict between the adversarial watermarks generated by different facial images and models.
Firstly, we propose a patch selection and refining scheme to find the pixels which have the greatest importance for attack and remove the inconsequential perturbations gradually.
Based on each selected branch, the approach constructs the subgraph with parameters of distance and search level, while using branches' LODF metrics as the weights.
We perform multi-source data fusion for training IDS in a cyber-physical power system testbed where we collect cyber and physical side data from multiple sensors emulating real-world data sources that would be found in a utility and synthesizes these into features for algorithms to detect intrusions.
The usage and configuration of DNP3 with real-world equipment in to achieve power system monitoring and control of a large-scale synthetic electric grid via this DNP3 communication is presented.
This paper presents an approach to address this challenge through bio-inspired power system network design to improve system reliability and resilience against disturbances.
Power system restoration is a highly complex task that must be performed in a timely manner following a blackout.
Experimental results on an 8-accent English speech recognition show both methods can yield WERs close to the conventional ASR systems that completely ignore the accent, as well as desired AR accuracy.
In this paper, we propose a novel meta-learning based 3D point signature model, named 3Dmetapointsignature (MEPS) network, that is capable of learning robust point signatures in 3D shapes.
Many graph embedding approaches have been proposed for knowledge graph completion via link prediction.
Recent works introduce convolutional neural networks (CNNs) to extract high-level feature maps and find correspondences through feature matching.
Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre.
In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance.
We design a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation in order to integrate advantages and avoid limitations of these two categories of approaches.
However, given all the historical transaction records, it is challenging to predict the sale price of the remaining seats at any future timestamp, not only because that the sale price is relevant to a lot of features (seat locations, date-to-event of the transaction, event date, team performance, etc.
Video action recognition, a critical problem in video understanding, has been gaining increasing attention.
The redundant features existing in high dimensional datasets always affect the performance of learning and mining algorithms.