Video instance segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving.
Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice using just a few seconds of recording while maintaining a high level of realism.
This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models.
However, most healthcare datasets with high-quality (gold-standard) labels are small-scale, as directly collecting ground truth is often costly and time-consuming.
Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language.
Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM).
Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate.
Thus we propose to repetitively utilize the given visual condition in a cyclic way, by planting the visual condition as a high-concentration "seed" at the initialization step of the denoising process, and "diffuse" it into a harmonious picture by controlling a one-way information flow from the visual condition.
In this report, we successfully validated the effectiveness of the decoupling strategy in video panoptic segmentation.
The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement.
Ranked #1 on Video Panoptic Segmentation on VIPSeg
In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance.
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities.
Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner.
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis.
We then propose to further explore the critical step in the denoising trajectory that characterizes the convergence of a pre-trained DDM.
In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
Extensive experiments show that our model achieves 52. 06% in terms of accuracy (versus 58. 93% in fully supervised setting) on RefCOCO+@testA, when only using 1% of the mask annotations.
Entangled representation of clothing and identity (ID)-intrinsic clues are potentially concomitant in conventional person Re-IDentification (ReID).
Instead of sampling positive and negative links and heuristically computing the features of their enclosing subgraphs, GraphLP utilizes the feature learning ability of deep-learning models to automatically extract the structural patterns of graphs for link prediction under the assumption that real-world graphs are not locally isolated.
However, due to the limited resources of edge servers, it is difficult to meet the optimization goals of the two methods at the same time.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Ranked #1 on Audio Classification on ESC-50 (using extra training data)
no code implementations • 7 Dec 2022 • Yinpeng Dong, Peng Chen, Senyou Deng, Lianji L, Yi Sun, Hanyu Zhao, Jiaxing Li, Yunteng Tan, Xinyu Liu, Yangyi Dong, Enhui Xu, Jincai Xu, Shu Xu, Xuelin Fu, Changfeng Sun, Haoliang Han, Xuchong Zhang, Shen Chen, Zhimin Sun, Junyi Cao, Taiping Yao, Shouhong Ding, Yu Wu, Jian Lin, Tianpeng Wu, Ye Wang, Yu Fu, Lin Feng, Kangkang Gao, Zeyu Liu, Yuanzhe Pang, Chengqi Duan, Huipeng Zhou, Yajie Wang, Yuhang Zhao, Shangbo Wu, Haoran Lyu, Zhiyu Lin, YiFei Gao, Shuang Li, Haonan Wang, Jitao Sang, Chen Ma, Junhao Zheng, Yijia Li, Chao Shen, Chenhao Lin, Zhichao Cui, Guoshuai Liu, Huafeng Shi, Kun Hu, Mengxin Zhang
The security of artificial intelligence (AI) is an important research area towards safe, reliable, and trustworthy AI systems.
Deep learning models have shown great promise in various healthcare applications.
There is a surge in interest in self-supervised learning approaches for end-to-end speech encoding in recent years as they have achieved great success.
This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor.
Compared with a supervised baseline and the WavLM-based SS model using feature embeddings obtained with the previously released 94K hours trained WavLM, our proposed model obtains 15. 9% and 11. 2% of relative word error rate (WER) reductions, respectively, for a simulated far-field speech mixture test set.
In this paper, we propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
RGB videos, however, are raw signals with substantial visual redundancy, leading the encoder to overlook the key information for sign language understanding.
Meetings are an essential form of communication for all types of organizations, and remote collaboration systems have been much more widely used since the COVID-19 pandemic.
4 code implementations • 12 Oct 2022 • Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, Furu Wei
A big convergence of model architectures across language, vision, speech, and multimodal is emerging.
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system.
In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation.
Particularly, SiRi conveys a significant principle to the research of visual grounding, i. e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly.
Online influence maximization aims to maximize the influence spread of a content in a social network with unknown network model by selecting a few seed nodes.
Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis.
Cardiorespiratory fitness is an established predictor of metabolic disease and mortality.
Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition.
In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning).
On the other hand, the parallel reference lines of the parallel decomposition methods including the normal boundary intersection (NBI) might result in poor diversity because of under-sampling near the boundaries for MaOPs with concave frontiers.
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates complex musical samples conditioned on dance videos.
The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency.
This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR).
Retrieving target videos based on text descriptions is a task of great practical value and has received increasing attention over the past few years.
In contrast, our large-scale VIdeo Panoptic Segmentation in the Wild (VIPSeg) dataset provides 3, 536 videos and 84, 750 frames with pixel-level panoptic annotations, covering a wide range of real-world scenarios and categories.
Existing NAS-based meta-learning methods apply a two-stage strategy, i. e., first searching architectures and then re-training meta-weights on the searched architecture.
Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.
5 code implementations • 26 Oct 2021 • Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks.
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
We integrate the proposed methods into the HuBERT framework.
In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning.
ILMA enables a fast text-only adaptation of the E2E model without increasing the run-time computational cost.
Genetic barcoding coupled with single-cell sequencing technology enables direct measurement of cell-to-cell transitions and gene-expression evolution over a long timespan.
We focus on the problem of segmenting a certain object referred by a natural language sentence in video content, at the core of formulating a pinpoint vision-language relation.
To deal with this problem, instead of introducing knowledge base as the input, we force the model to learn a better semantic representation by predicting the information in the knowledge base, only based on the input context.
Federated Learning (FL) enables the multiple participating devices to collaboratively contribute to a global neural network model while keeping the training data locally.
Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset.
Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR).
Current vision and language tasks usually take complete visual data (e. g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e. g., restricted view with fixed camera or intentional vision block for security concerns.
To the best of our knowledge, our VSPW is the first attempt to tackle the challenging video scene parsing task in the wild by considering diverse scenarios.
In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference.
To address the issue, we propose a template-based method for NER, treating NER as a language model ranking problem in a sequence-to-sequence framework, where original sentences and statement templates filled by candidate named entity span are regarded as the source sequence and the target sequence, respectively.
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR).
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.
Ranked #4 on Referring Expression Segmentation on J-HMDB (Precision@0.9 metric)
In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner.
We further improve ImagineRNN by residual anticipation, i. e., changing its target to predicting the feature difference of adjacent frames instead of the frame content.
Conventional approaches for formality style transfer borrow models from neural machine translation, which typically requires massive parallel data for training.
With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently.
Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition.
With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources.
Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription.
Ranked #1 on Speech Separation on LibriCSS (using extra training data)
We quantitatively investigate the presence of structural commonsense cues in BERT when solving commonsense tasks, and the importance of such cues for the model prediction.
The commonly used framework for unsupervised machine translation builds initial translation models of both translation directions, and then performs iterative back-translation to jointly boost their translation performance.
We present a work-flow which aims at capturing residents' abnormal activities through the passenger flow of elevator in multi-storey residence buildings.
Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode.
End-to-end speech translation poses a heavy burden on the encoder, because it has to transcribe, understand, and learn cross-lingual semantics simultaneously.
Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques.
The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode.
Audio and Speech Processing
Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification.
Ranked #4 on Egocentric Activity Recognition on EGTEA
Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks.
Formality text style transfer plays an important role in various NLP applications, such as non-native speaker assistants and child education.
The duration of these segments is usually short, making the visual and acoustic feature of each segment possibly not well aligned.
This lightweight layer incorporates a simple l2 normalization, enabling our transformation unit applicable to operator-level without much increase of additional parameters.
End-to-end speech translation, a hot topic in recent years, aims to translate a segment of audio into a specific language with an end-to-end model.
Pre-training has proven to be effective in unsupervised machine translation due to its ability to model deep context information in cross-lingual scenarios.
By this novel cascaded captioning-revising mechanism, CRN can accurately describe images with unseen objects.
In this report, we present the Baidu-UTS submission to the EPIC-Kitchens Action Recognition Challenge in CVPR 2019.
In this paper, we empirically study this problem and introduce 1) a simple yet effective baseline that achieves promising performance; 2) an easier and practical setting for EmbodiedQA where an agent has a chance to adapt the trained model to a new environment before it actually answers users questions.
We propose to automatically search for a CNN architecture that is specifically suitable for the reID task.
Ranked #9 on Person Re-Identification on CUHK03 detected
In this paper, we introduce a novel natural language generation task, termed as text morphing, which targets at generating the intermediate sentences that are fluency and smooth with the two input sentences.
We study open domain dialogue generation with dialogue acts designed to explain how people engage in social chat.
An intuitive way for a human to write paraphrase sentences is to replace words or phrases in the original sentence with their corresponding synonyms and make necessary changes to ensure the new sentences are fluent and grammatically correct.
Open domain response generation has achieved remarkable progress in recent years, but sometimes yields short and uninformative responses.
We focus on the one-shot learning for video-based person re-Identification (re-ID).
We propose a method that can leverage unlabeled data to learn a matching model for response selection in retrieval-based chatbots.
Conventional methods model open domain dialogue generation as a black box through end-to-end learning from large scale conversation data.
The task requires matching a response candidate with a conversation context, whose challenges include how to recognize important parts of the context, and how to model the relationships among utterances in the context.
This paper presents the system in SemEval-2017 Task 3, Community Question Answering (CQA).
Since the matrix formed by nonlocal similar patches in a natural image is of low rank, the nuclear norm minimization (NNM) has been widely used in various image processing studies.
Nonlocal image representation or group sparsity has attracted considerable interest in various low-level vision tasks and has led to several state-of-the-art image denoising techniques, such as BM3D, LSSC.
Person re-identification (re-ID) and attribute recognition share a common target at learning pedestrian descriptions.
Ranked #75 on Person Re-Identification on DukeMTMC-reID
With the word level attention, hidden vectors of a word level encoder are synthesized as utterance vectors and fed to an utterance level encoder to construct hidden representations of the context.
Existing work either concatenates utterances in context or matches a response with a highly abstract context vector finally, which may lose relationships among utterances or important contextual information.
Ranked #7 on Conversational Response Selection on RRS
While automatic response generation for building chatbot systems has drawn a lot of attention recently, there is limited understanding on when we need to consider the linguistic context of an input text in the generation process.
We consider incorporating topic information into the sequence-to-sequence framework to generate informative and interesting responses for chatbots.
We propose a learning algorithm for fair classification that achieves both group fairness (the proportion of members in a protected group receiving positive classification is identical to the proportion in the population as a whole), and individual fairness (similar individuals should be treated similarly).