Open-domain question answering is a challenging task with a wide variety of practical applications.
This paper proposes a novel 3D sparse convolution-based Deep Dynamic Point Cloud Compression (D-DPCC) network to compensate and compress the DPC geometry with 3D motion estimation and motion compensation in the feature space.
With the popularity of multi-modal sensors, visible-thermal (RGB-T) object tracking is to achieve robust performance and wider application scenarios with the guidance of objects' temperature information.
Based on the MDP similarity, we design a hybrid transfer method (consisting of instance transfer and strategy transfer) to efficiently transfer the samples and the knowledge from one entrance to another.
We conduct extensive offline experiments on the effectiveness of these auxiliary tasks and test our method on real-world food delivery platform.
A mixed list of ads and organic items is usually displayed in feed and how to allocate the limited slots to maximize the overall revenue is a key problem.
Modern smart sensor-based energy management systems leverage non-intrusive load monitoring (NILM) to predict and optimize appliance load distribution in real-time.
First, we present a Transformer tracking (named TransT) method based on the Siamese-like feature extraction backbone, the designed attention-based fusion mechanism, and the classification and regression head.
NTM-DMIE is a neural network method for topic learning which maximizes the mutual information between the input documents and their latent topic representation.
One of the most critical challenges in deep reinforcement learning is to maintain the long-term exploration capability of the agent.
Very often, the pose is estimated after the face detection.
SpeechFlow is a powerful factorization model based on information bottleneck (IB), and its effectiveness has been reported by several studies.
Overparameterized transformer-based architectures have shown remarkable performance in recent years, achieving state-of-the-art results in speech processing tasks such as speech recognition, speech synthesis, keyword spotting, and speech enhancement et al.
Validation of these models, however, has been a challenge because the ground truth is unknown: only one treatment-outcome pair for each person can be observed.
In this paper, we propose a novel time-aware sequential recommendation framework called Social Temporal Excitation Networks (STEN), which introduces temporal point processes to model the fine-grained impact of friends' behaviors on the user s dynamic interests in an event-level direct paradigm.
Normal training methods from such biased data tend to repetitively generate recommendations from the head items, which further exacerbates the data bias and affects the exploration of potentially interesting items from niche (tail) items.
In this paper, we propose a novel model named DemiNet (short for DEpendency-Aware Multi-Interest Network}) to address the above two issues.
Our model results in higher revenue and better user experience than state-of-the-art baselines in offline experiments.
To leverage the strength of text generation for information retrieval, in this article, we propose a novel approach which effectively integrates text generation models into PRF-based query expansion.
Ranking models have achieved promising results, but it remains challenging to design personalized ranking systems to leverage user profiles and semantic representations between queries and documents.
Deep learning based visual trackers entail offline pre-training on large volumes of video datasets with accurate bounding box annotations that are labor-expensive to achieve.
In addition to the Language Identification (LID) tasks, multilingual Automatic Speech Recognition (ASR) tasks are introduced to OLR 2021 Challenge for the first time.
Furthermore, we leverage a variational auto-encoder (VAE) model to capture the life-long novelty of states, which is combined with the global JFI score to form multimodal intrinsic rewards.
The fifth Oriental Language Recognition (OLR) Challenge focuses on language recognition in a variety of complex environments to promote its development.
Though recent works have developed methods that can generate estimates (or imputations) of the missing entries in a dataset to facilitate downstream analysis, most depend on assumptions that may not align with real-world applications and could suffer from poor performance in subsequent tasks such as classification.
Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning.
Recent works aimed to improve the robustness of pre-trained models mainly focus on adversarial training from perturbed examples with similar semantics, neglecting the utilization of different or even opposite semantics.
To address the above challenges, we develop a deep learning based Analogy-aware Offensive Meme Detection (AOMD) framework to learn the implicit analogy from the multi-modal contents of the meme and effectively detect offensive analogy memes.
Automatic speaker verification (ASV) is a well developed technology for biometric identification, and has been ubiquitous implemented in security-critic applications, such as banking and access control.
Towards these challenges we introduce a dataset, Detecting Underwater Objects (DUO), and a corresponding benchmark, based on the collection and re-annotation of all relevant datasets.
In this paper, we propose a streaming end-to-end framework that can process multiple intentions in an online and incremental way.
Object tracking has achieved significant progress over the past few years.
In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component.
Ranked #3 on Visual Object Tracking on TrackingNet
The correlation operation is a simple fusion manner to consider the similarity between the template and the search region.
Ranked #6 on Visual Object Tracking on LaSOT
To solve the above problem, in this work, we propose a novel method called Adversarial and Contrastive Variational Autoencoder (ACVAE) for sequential recommendation.
The bottleneck of solving this problem is the accurate perception of rapid dynamic objects.
There has been growing interest in representation learning for text data, based on theoretical arguments and empirical evidence.
Video-based person re-identification aims to associate the video clips of the same person across multiple non-overlapping cameras.
Ranked #1 on Person Re-Identification on DukeMTMC-VideoReID
The primary goal of knowledge distillation (KD) is to encapsulate the information of a model learned from a teacher network into a student network, with the latter being more compact than the former.
The main reason is that large number of nodes (i. e., video frames) makes GCNs hard to capture and model temporal relations in videos.
Ranked #13 on Action Segmentation on Breakfast
Many recent trackers adopt the multiple-stage tracking strategy to improve the quality of bounding box estimation.
Deep neural networks excel at comprehending complex visual signals, delivering on par or even superior performance to that of human experts.
This paper proposes a decoder-side Cross Resolution Synthesis (CRS) module to pursue better compression efficiency beyond the latest Versatile Video Coding (VVC), where we encode intra frames at original high resolution (HR), compress inter frames at a lower resolution (LR), and then super-resolve decoded LR inter frames with the help from preceding HR intra and neighboring LR inter frames.
Dialogue Act Recognition (DAR) is a challenging problem in Natural Language Understanding, which aims to attach Dialogue Act (DA) labels to each utterance in a conversation.
Local dependencies, which captures short-term emotional effects between neighbouring utterances, are further injected via an Aggregation Graph to distinguish the subtle differences between utterances containing emotional phrases.
Ranked #11 on Emotion Recognition in Conversation on IEMOCAP
Recently, speech enhancement (SE) based on deep speech prior has attracted much attention, such as the variational auto-encoder with non-negative matrix factorization (VAE-NMF) architecture.
In this paper, we argue that this problem is largely attributed to the maximum-likelihood (ML) training criterion of the DNF model, which aims to maximize the likelihood of the observations but not necessarily improve the Gaussianality of the latent codes.
Domain mismatch often occurs in real applications and causes serious performance reduction on speaker verification systems.
Various information factors are blended in speech signals, which forms the primary difficulty for most speech information processing tasks.
Furthermore, WANA proposes a set of test oracles to detect the vulnerabilities in EOSIO and Ethereum smart contracts based on WebAssembly bytecode analysis.
Software Engineering D.2.5
In this study, we propose a novel RGB-T tracking framework by jointly modeling both appearance and motion cues.
Active contour models have been widely used in image segmentation, and the level set method (LSM) is the most popular approach for solving the models, via implicitly representing the contour by a level set function.
iii) How to efficiently guide the cars to the event locations with little prior knowledge of the road damage caused by the disaster, while also handling the dynamics of the physical world and social media?
Based on Kaldi and Pytorch, recipes for i-vector and x-vector systems are also conducted as baselines for the three tasks.
Domain generalization remains a critical problem for speaker recognition, even with the state-of-the-art architectures based on deep neural nets.
Audio and Speech Processing
Human conversations contain many types of information, e. g., knowledge, common sense, and language habits.
Generative adversarial networks (GANs), famous for the capability of learning complex underlying data distribution, are however known to be tricky in the training process, which would probably result in mode collapse or performance deterioration.
In this vision paper, we discuss the roles of CovidSens and identify potential challenges in developing reliable social sensing based risk alert systems.
Most top-ranked long-term trackers adopt the offline-trained Siamese architectures, thus, they cannot benefit from great progress of short-term trackers with online update.
An effective and efficient perturbation generator is trained with a carefully designed adversarial loss, which can simultaneously cool hot regions where the target exists on the heatmaps and force the predicted bounding box to shrink, making the tracked target invisible to trackers.
Then, we propose two types of inference models, opt-GRU and relation-GRU, which are used to encode the object relationship and motion representation effectively, and form the discriminative frame-level feature representation.
Ranked #2 on Group Activity Recognition on Volleyball
In this paper, we present a graph representation learning method atop of transaction networks for merchant incentive optimization in mobile payment marketing.
Our model builds upon an variational encoder which transforms the input video into a latent feature space and a Luenberger-type observer which captures the dynamic evolution of the latent features.
no code implementations • 20 Dec 2019 • Xi Liu, Rui Zhang, Yongsheng Zhou, Qianyi Jiang, Qi Song, Nan Li, Kai Zhou, Lei Wang, Dong Wang, Minghui Liao, Mingkun Yang, Xiang Bai, Baoguang Shi, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar
21 teams submit results for Task 1, 23 teams submit results for Task 2, 24 teams submit results for Task 3, and 13 teams submit results for Task 4.
Inference, estimation, sampling and likelihood evaluation are four primary goals of probabilistic modeling.
In this work, we design a simple computing library to bridge the gap and decouple the work of ocean modeling from parallel computing.
Attention-based models have shown significant improvement over traditional algorithms in several NLP tasks.
The ROI (region-of-interest) based pooling method performs pooling operations on the cropped ROI regions for various samples and has shown great success in the object detection methods.
These datasets tend to deliver over optimistic performance and do not meet the request of research on speaker recognition in unconstrained conditions.
Speech signals are complex composites of various information, including phonetic content, speaker traits, channel effect, etc.
no code implementations • • Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, QinGhua Hu, Tao Peng, Jiayu Zheng, Xinyao Wang, Yue Zhang, Liefeng Bo, Hailin Shi, Rui Zhu, Aashish Kumar, Aijin Li, Almaz Zinollayev, Anuar Askergaliyev, Arne Schumann, Binjie Mao, Byeongwon Lee, Chang Liu, Changrui Chen, Chunhong Pan, Chunlei Huo, Da Yu, Dechun Cong, Dening Zeng, Dheeraj Reddy Pailla, Di Li, Dong Wang, Donghyeon Cho, Dongyu Zhang, Furui Bai, George Jose, Guangyu Gao, Guizhong Liu, Haitao Xiong, Hao Qi, Haoran Wang, Heqian Qiu, Hongliang Li, Huchuan Lu, Ildoo Kim, Jaekyum Kim, Jane Shen, Jihoon Lee, Jing Ge, Jingjing Xu, Jingkai Zhou, Jonas Meier, Jun Won Choi, Junhao Hu, Junyi Zhang, Junying Huang, Kaiqi Huang, Keyang Wang, Lars Sommer, Lei Jin, Lei Zhang
Results of 33 object detection algorithms are presented.
In this work, we propose a novel gradient-guided network to exploit the discriminative information in gradients and update the template in the siamese network through feed-forward and backward operations.
Ranked #3 on Visual Object Tracking on OTB-2015 (Precision metric)
Social sensing has emerged as a new sensing paradigm where humans (or devices on their behalf) collectively report measurements about the physical world.
In this work, we present a novel robust and real-time long-term tracking framework based on the proposed skimming and perusal modules.
By enforcing the neural model to discriminate the speakers in the training set, deep speaker embedding (called `x-vectors`) can be derived from the hidden layers.
Current clickbait detection solutions that mainly focus on analyzing the text of the title, the image of the thumbnail, or the content of the video are shown to be suboptimal in detecting the online clickbait videos.
This technical report attempts to provide efficient and solid kits addressed on the field of crowd counting, which is denoted as Crowd Counting Code Framework (C$^3$F).
We propose a Leaked Motion Video Predictor (LMVP) to predict future frames by capturing the spatial and temporal dependencies from given inputs.
Our method accelerates the network in one-step pruning-recovery manner with a novel optimization objective function, which achieves higher accuracy with much less cost compared with existing pruning methods.
Deep learning models have a large number of freeparameters that need to be calculated by effective trainingof the models on a great deal of training data to improvetheir generalization performance.
Giving conversational context with persona information to a chatbot, how to exploit the information to generate diverse and sustainable conversations is still a non-trivial task.
First-order methods such as stochastic gradient descent (SGD) are currently the standard algorithm for training deep neural networks.
Action Prediction is aimed to determine what action is occurring in a video as early as possible, which is crucial to many online applications, such as predicting a traffic accident before it happens and detecting malicious actions in the monitoring system.
The classification object ensures that each modal network predicts the true action category while the competing objective encourages each modal network to outperform the other one.
3) Results of motion orientation and magnitude are adaptively weighted and fused by a Bayesian model, which makes the proposed method more robust and handle more kinds of abnormal events.
Human actions captured in video sequences contain two crucial factors for action recognition, i. e., visual appearance and motion dynamics.
In this paper, we propose a novel iterative convolution-thresholding method (ICTM) that is applicable to a range of variational models for image segmentation.
different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.
The Cognitive Ocean Network (CONet) will become the mainstream of future ocean science and engineering developments.
This paper proposes a Gaussian-constrained training approach that (1) discards the parametric classifier, and (2) enforces the distribution of the derived speaker vectors to be Gaussian.
This score reflects the similarity of the two frames in phonetic content, and is used to weigh the contribution of this frame pair in the utterance-based scoring.
Compared with short-term tracking, the long-term tracking task requires determining the tracked object is present or absent, and then estimating the accurate bounding box if present or conducting image-wide re-detection if absent.
In consideration of intrinsic consistency between informativeness of the regions and their probability being ground-truth class, we design a novel training paradigm, which enables Navigator to detect most informative regions under the guidance from Teacher.
Ranked #28 on Fine-Grained Image Classification on FGVC Aircraft
In this paper, we circumvent this issue by proposing a local structure learning method, which simultaneously considers the local patterns of the target and their structural relationships for more accurate target tracking.
The third oriental language recognition (OLR) challenge AP18-OLR is introduced in this paper, including the data profile, the tasks and the evaluation principles.
To address these issues, we propose a multi-stream bottom-top-bottom fully convolutional network (BTBNet), which is the first attempt to develop an end-to-end deep network for DBD.
Ranked #2 on Defocus Estimation on CUHK - Blur Detection Dataset (MAE metric)
It is ubiquitous that time series contains many missing values.
To address this issue, we propose a novel CF-based optimization problem to jointly model the discrimination and reliability information.
Various informative factors mixed in speech signals, leading to great difficulty when decoding any of the factors.
Extensive experiments demonstrate that the proposed algorithm achieves competitive performance in both saliency detection and visual tracking, especially outperforming other related trackers on the non-rigid object tracking datasets.
This paper proposes an Agile Aggregating Multi-Level feaTure framework (Agile Amulet) for salient object detection.
Trivial events are ubiquitous in human to human conversations, e. g., cough, laugh and sniff.
In hospital, physicians rely on massive clinical data to make diagnosis decisions, among which laboratory tests are one of the most important resources.
In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e. g., 0. 3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model.
The intensive annotation cost and the rich but unlabeled data contained in videos motivate us to propose an unsupervised video-based person re-identification (re-ID) method.
Ranked #6 on Person Re-Identification on PRID2011
Neural machine translation (NMT) has achieved notable success in recent times, however it is also widely recognized that this approach has limitations with handling infrequent words and word pairs.
In this paper, we propose a novel deep fully convolutional network model for accurate salient object detection.
Ranked #5 on Saliency Detection on DUT-OMRON
In addition, to achieve accurate boundary inference and semantic enhancement, edge-aware feature maps in low-level layers and the predicted results of low resolution features are recursively embedded into the learning framework.
Ranked #18 on RGB Salient Object Detection on DUTS-TE
In the second stage, FIN is fine-tuned with its predicted saliency maps as ground truth.
This principle has recently been applied to several prototype research on speaker verification (SV), where the feature learning and classifier are learned together with an objective function that is consistent with the evaluation metric.
Second, we propose a fully convolutional neural network with spatially regularized kernels, through which the filter kernel corresponding to each output channel is forced to focus on a specific region of the target.
Ranked #11 on Visual Object Tracking on VOT2017/18
The experiments demonstrated that the feature-based system outperformed the i-vector system with a large margin, particularly with language mismatch between enrollment and test.
In this paper, we demonstrated that the speaker factor is also a short-time spectral pattern and can be largely identified with just a few frames using a simple deep neural network (DNN).
It has been shown that Chinese poems can be successfully generated by sequence-to-sequence neural models, particularly with the attention mechanism.
Recently deep neural networks (DNNs) have been used to learn speaker features.
Deep neural models, particularly the LSTM-RNN model, have shown great potential for language identification (LID).
Pure acoustic neural models, particularly the LSTM-RNN model, have shown great potential in language identification (LID).
Recurrent neural networks (RNNs) have shown clear superiority in sequence modeling, particularly the ones with gated units, such as long short-term memory (LSTM) and gated recurrent unit (GRU).
This paper presents a unified model to perform language and speaker recognition simultaneously and altogether.
We present the AP16-OL7 database which was released as the training and test data for the oriental language recognition (OLR) challenge on APSIPA 2016.
PLDA is a popular normalization approach for the i-vector model, and it has delivered state-of-the-art performance in speaker verification.
We present the OC16-CE80 Chinese-English mixlingual speech database which was released as a main resource for training, development and test for the Chinese-English mixlingual speech recognition (MixASR-CHEN) challenge on O-COCOSDA 2016.
Although highly correlated, speech and speaker recognition have been regarded as two independent tasks and studied by two communities.
This paper presents a combination approach to the SUSR tasks with two phonetic-aware systems: one is the DNN-based i-vector system and the other is our recently proposed subregion-based GMM-UBM system.
Given a base tracker, an ensemble of trackers is generated, in which each tracker's update behavior will be paced and then traces the target object forward and backward to generate a pair of trajectories in an interval.
In this paper, we present a novel approach to automatic 3D Facial Expression Recognition (FER) based on deep representation of facial 3D geometric and 2D photometric attributes.
The popular i-vector model represents speakers as low-dimensional continuous vectors (i-vectors), and hence it is a way of continuous speaker embedding.
Probabilistic linear discriminant analysis (PLDA) is a popular normalization approach for the i-vector model, and has delivered state-of-the-art performance in speaker recognition.
This assumption does not hold for a wide range of data types in practical applications, for instance spherical data for which the local proximity is better modelled by the von Mises-Fisher (vMF) distribution instead of the Gaussian.
A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN).
Compared to the conventional layer-wise methods, this new method does not care about the model structure, so can be used to pre-train very complex models.
Recent research shows that deep neural networks (DNNs) can be used to extract deep speaker vectors (d-vectors) that preserve speaker characteristics and can be used in speaker verification.
Recent research found that a well-trained model can be used as a teacher to train other child models, by using the predictions generated by the teacher model as supervision.