While deep neural networks have achieved remarkable successes, they suffer the well-known catastrophic forgetting issue when switching from existing tasks to tackle a new one.
When we do online paired data augmentation, we first generate augmented text through random token replacement, then pass the augmented text into the latent space alignment module to output the latent codes, which are finally fed to StyleGAN2 to generate the augmented images.
To this end, we discover the semantic meanings of StyleGAN latent space, such that we are able to produce face images of various expressions, poses, and lighting by controlling the latent codes.
To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL).
Anomaly detection in multivariate time series plays an important role in monitoring the behaviors of various real-world systems, e. g., IT system operations or manufacturing industry.
Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues.
We demonstrate the efficacy of MUST on 8 downstream tasks across a variety of domains, where it improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
We introduce OmniXAI (short for Omni eXplainable AI), an open-source Python library of eXplainable AI (XAI), which offers omni-way explainable AI capabilities and various interpretable machine learning techniques to address the pain points of understanding and interpreting the decisions made by machine learning (ML) in practice.
Counterfactual explanation is an important Explainable AI technique to explain machine learning predictions.
Prompt Tuning (PT) has been largely successful as a parameter-efficient way of conditioning large-scale pre-trained language models towards a downstream task.
ICA forms the backbone of a simple-yet-effective Retrieval based RCA for new incidents, through an Information Retrieval system to search and rank past incidents and detect likely root causes from them, given the incident symptom.
The fast adaptation capability of deep neural networks in non-stationary environments is critical for online time series forecasting.
To achieve this, we first introduce an entity prompter module, which is trained with VTC to produce the similarity between a video crop and text prompts instantiated with entity names.
Ranked #2 on Visual Question Answering on MSRVTT-QA (using extra training data)
Ideally, how a node receives its neighborhood information should be a function of its local context, to diverge from the global GNN model shared by all nodes.
ICCL interpolates two images from a class-agnostic sampler and a class-aware sampler, and trains the model such that the representation of the interpolative image can be used to retrieve the centroids for both source classes.
The goal of natural language semantic code search is to retrieve a semantically relevant code snippet from a fixed set of candidates using a natural language query.
Our approach brings together several novel ideas in a systematic framework: (1) exploiting an unsupervised learning approach to obtain the sentence-level tree structure labels before training; (2) generating trees of target recipes from images with the supervision of tree structure labels learned from (1); and (3) integrating the learned tree structures into the recipe generation and food cross-modal retrieval procedure.
We present CodeT5, a unified pre-trained encoder-decoder Transformer model that better leverages the code semantics conveyed from the developer-assigned identifiers.
Ranked #1 on Text-to-Code Generation on CodeXGLUE - CONCODE
Video captioning targets interpreting the complex visual contents as text descriptions, which requires the model to fully understand video scenes including objects and their interactions.
In this paper, we propose a novel unified framework of Cycle-consistent Inverse GAN (CI-GAN) for both text-to-image generation and text-guided image manipulation tasks.
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context.
Existing food image segmentation models are underperforming due to two reasons: (1) there is a lack of high quality food image datasets with fine-grained ingredient labels and pixel-wise location masks -- the existing datasets either carry coarse ingredient labels or are small in size; and (2) the complex appearance of food makes it difficult to localize and recognize ingredients in food images, e. g., the ingredients may overlap one another in the same image, and the identical ingredient may appear distinctly in different food images.
Ranked #2 on Semantic Segmentation on FoodSeg103 (using extra training data)
Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images.
Neural Module Networks (NMNs) have been quite successful in incorporating explicit reasoning as learnable modules in various question answering tasks, including the most generic form of numerical reasoning over text in Machine Reading Comprehension (MRC).
3D face reconstruction plays a very important role in many real-world multimedia applications, including digital entertainment, social media, affection analysis, and person identification.
Sequential user behavior modeling plays a crucial role in online user-oriented services, such as product purchasing, news feed consumption, and online advertising.
Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns.
While participants in a multi-party multi-turn conversation simultaneously engage in multiple conversation topics, existing response selection methods are developed mainly focusing on a two-party single-conversation scenario.
Based on the learned EDU and entailment representations, we either reply to the user our final decision "yes/no/irrelevant" of the initial question, or generate a follow-up question to inquiry more information.
We propose momentum prototypes (MoPro), a simple contrastive learning method that achieves online label noise correction, out-of-distribution sample removal, and representation learning.
Ranked #12 on Image Classification on WebVision-1000
We investigate an open research task of generating cooking instructions based on only food images and ingredients, which is similar to the image captioning task.
Natural language interfaces to databases (NLIDB) democratize end user access to relational data.
Continual learning aims to learn continuously from a stream of tasks and data in an online-learning fashion, being capable of exploiting what was learned previously to improve current and future tasks while still being able to perform well on the previous tasks.
Recipe generation from food images and ingredients is a challenging task, which requires the interpretation of the information from another modality.
Automated disease classification of radiology images has been emerging as a promising technique to support clinical diagnosis and treatment planning.
Meta-learning methods have been extensively studied and applied in computer vision, especially for few-shot classification tasks.
The goal of conversational machine reading is to answer user questions given a knowledge base text which may require asking clarification questions.
Then we propose a theory-inspired path-regularized DARTS that consists of two key modules: (i) a differential group-structured sparse binary gate introduced for each operation to avoid unfair competition among operations, and (ii) a path-depth-wise regularization used to incite search exploration for deep architectures that often converge slower than shallow ones as shown in our theory and are not well explored during the search.
Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses.
The goal of conversational machine reading is to answer user questions given a knowledge base text which may require asking clarification questions.
This paper presents Prototypical Contrastive Learning (PCL), an unsupervised representation learning method that addresses the fundamental limitations of instance-wise contrastive learning.
Ranked #24 on Semi-Supervised Image Classification on ImageNet - 1% labeled data (Top 5 Accuracy metric)
Building an end-to-end conversational agent for multi-domain task-oriented dialogues has been an open challenge for two main reasons.
By contrast, in this work, we propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks.
We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos, unlike most existing methods which rely heavily on extensive annotated data.
Food retrieval is an important task to perform analysis of food-related information, where we are interested in retrieving relevant information about the queried food item such as ingredients, cooking instructions, etc.
Ranked #5 on Cross-Modal Retrieval on Recipe1M
Recent efforts in Dialogue State Tracking (DST) for task-oriented dialogues have progressed toward open-vocabulary or generation-based approaches where the models can generate slot value candidates from the dialogue history itself.
Ranked #13 on Multi-domain Dialogue State Tracking on MULTIWOZ 2.0
Incorporating hierarchical structures like constituency trees has been shown to be effective for various natural language processing (NLP) tasks.
Two prominent directions include learning with noisy labels and semi-supervised learning by exploiting unlabeled data.
Ranked #1 on Learning with noisy labels on CIFAR-100N
The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets.
Ranked #1 on Cross-Modal Person Re-Identification on RegDB-C
Similar to other information systems, a significant threat to industrial control systems is the attack from cyberspace---the offensive maneuvers launched by "anonymous" in the digital world that target computer-based assets with the goal of compromising a system's functions or probing for information.
To tackle this critical problem, we propose an attribute-aware pedestrian detector to explicitly model people's semantic attributes in a high-level feature detection fashion.
FoodAI has made food logging convenient, aiding smart consumption and a healthy lifestyle.
Specifically, Meta-RCNN learns an object detector in an episodic learning paradigm on the (meta) training data.
Object detection is a fundamental visual recognition problem in computer vision and has been widely studied in the past decades.
Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.)
However, CF with binary codes naturally suffers from low accuracy due to limited representation capability in each bit, which impedes it from modeling complex structure of the data.
Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle.
Ranked #6 on Cross-Modal Retrieval on Recipe1M
Most existing work assumes that both training and test tasks are drawn from the same distribution, and a large amount of labeled data is available in the training tasks.
Image Super-Resolution (SR) is an important class of image processing techniques to enhance the resolution of images and videos in computer vision.
The proposed model is able to boost the performance of data clustering, semisupervised classification, and data recovery significantly, primarily due to two key factors: 1) enhanced low-rank recovery by exploiting the graph smoothness assumption, 2) improved graph construction by exploiting clean data recovered by robust PCA.
Cost-Sensitive Online Classification has drawn extensive attention in recent years, where the main approach is to directly online optimize two well-known cost-sensitive metrics: (i) weighted sum of sensitivity and specificity; (ii) weighted misclassification cost.
Recent years have witnessed many exciting achievements for object detection using deep learning techniques.
This approach allows the model to capture several types of semantic information, which was not possible by the existing models.
Online learning represents an important family of machine learning algorithms, in which a learner attempts to resolve an online prediction (or any type of decision-making) task by learning a model/hypothesis from a sequence of data instances one at a time.
In order to efficiently search in such a large 4-DoF space in real-time, we formulate the problem into two 2-DoF sub-problems and apply an efficient Block Coordinates Descent solver to optimize the estimation result.
In this paper, we propose a novel simple yet effective framework of "Feature Agglomeration Networks" (FANet) to build a new single stage face detector, which not only achieves state-of-the-art performance but also runs efficiently.
Deep Neural Networks (DNNs) are typically trained by backpropagation in a batch learning setting, which requires the entire training data to be made available prior to the learning task.
The conditional gradient algorithm has regained a surge of research interest in recent years due to its high efficiency in handling large-scale machine learning problems.
In this report, we present a new face detection scheme using deep learning and achieve the state-of-the-art detection performance on the well-known FDDB face detetion benchmark evaluation.
This article aims to provide a comprehensive survey and a structural understanding of Malicious URL Detection techniques using machine learning.
SOL is an open-source library for scalable online learning algorithms, and is particularly suitable for learning with high-dimensional data.
In this paper, we propose a novel scheme of Online Bayesian Collaborative Topic Regression (OBCTR) which is efficient and scalable for learning from data streams.
Unlike the existing approaches, in this paper, we propose a novel approach called Sparse Product Quantization (SPQ) to encoding the high-dimensional feature vectors into sparse representation.
Despite their encouraging results reported, the existing online AUC maximization algorithms often adopt simple online gradient descent approaches that fail to exploit the geometrical knowledge of the data observed during the online learning process, and thus could suffer from relatively larger regret.
To overcome this drawback, we present a novel framework of Budget Online Multiple Kernel Learning (BOMKL) and propose a new Sparse Passive Aggressive learning to perform effective budget online learning.
In this paper, we introduce "LOGO-Net", a large-scale logo image database for logo detection and brand recognition from real-world product images.
Unlike some existing online data stream classification techniques that are often based on first-order online learning, we propose a framework of Sparse Online Classification (SOC) for data stream classification, which includes some state-of-the-art first-order sparse online learning algorithms as special cases and allows us to derive a new effective second-order online learning algorithm for data stream classification.
Most modern trackers typically employ a bounding box given in the first frame to track visual objects, where their tracking results are often sensitive to the initialization.
However, unlike many second-order learning methods that often suffer from extra high computational cost, we devise a novel smart algorithm for second-order online feature selection using a MaxHeap-based approach, which is not only more effective than the existing first-order approaches, but also significantly more efficient and scalable for large-scale feature selection with ultra-high dimensional sparse data, as validated from our extensive experiments.
This article aims to provide a timely and comprehensive survey for both machine learning and data mining researchers in academia and quantitative portfolio managers in the financial industry to help them understand the state-of-the-art and facilitate their research and practical applications.
Machine learning techniques have been adopted to select portfolios from financial markets in some emerging intelligent business applications.