In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2$\sim$3 times larger than MHA.
In this paper, we aim to advance the research of multi-modal pre-training on E-commerce and subsequently contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs, covering more than 6, 000 categories and 5, 000 attributes.
We present Voxel Transformer (VoTr), a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Ranked #1 on 3D Object Detection on waymo vehicle (L1 mAP metric)
To resolve the problems, we propose a novel second-stage module, named pyramid RoI head, to adaptively learn the features from the sparse points of interest.
Ranked #1 on 3D Object Detection on waymo vehicle (AP metric)
We attribute this ranking correlation problem to the supernet training consistency shift, including feature shift and parameter shift.
Medical imaging technologies, including computed tomography (CT) or chest X-Ray (CXR), are largely employed to facilitate the diagnosis of the COVID-19.
Virtual 3D try-on can provide an intuitive and realistic view for online shopping and has a huge potential commercial value.
Fine-tuning from pre-trained ImageNet models has been a simple, effective, and popular approach for various computer vision tasks.
Despite recent progress on image-based virtual try-on, current methods are constraint by shared warping networks and thus fail to synthesize natural try-on results when faced with clothing categories that require different warping operations.
In this paper, we investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval among fine-grained product categories.
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target by destroying the most instructive information in instructions at different timesteps.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Previous math word problem solvers following the encoder-decoder paradigm fail to explicitly incorporate essential math symbolic constraints, leading to unexplainable and unreasonable predictions.
Neural text generation models are typically trained by maximizing log-likelihood with the sequence cross entropy loss, which encourages an exact token-by-token match between a target sequence with a generated sequence.
To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset.
Aiming at facilitating a real-world, ever-evolving and scalable autonomous driving system, we present a large-scale benchmark for standardizing the evaluation of different self-supervised and semi-supervised approaches by learning from raw data, which is the first and largest benchmark to date.
However, since for a query, its negatives are uniformly sampled from all graphs, existing methods suffer from the critical sampling bias issue, i. e., the negatives likely having the same semantic structure with the query, leading to performance degradation.
However, these works do not explicitly reduce the data bias across different house scenes.
Ranked #16 on Vision and Language Navigation on VLN Challenge
To address these limitations, we propose Quantifiable Dialogue Coherence Evaluation (QuantiDCE), a novel framework aiming to train a quantifiable dialogue coherence metric that can reflect the actual human rating standards.
Therefore, we propose a Geometric Question Answering dataset GeoQA, containing 5, 010 geometric problems with corresponding annotated programs, which illustrate the solving process of the given problems.
While existing NAS methods mostly design architectures on a single task, algorithms that look beyond single-task search are surging to pursue a more efficient and universal solution across various tasks.
We further propose a novel geometry solving approach with formal language and symbolic reasoning, called Interpretable Geometry Problem Solver (Inter-GPS).
Ranked #1 on Mathematical Question Answering on Geometry3K
In this task, an agent is required to navigate from an arbitrary position in a 3D embodied environment to localize a target following a scene description.
The model encodes discourse information as a graph with elementary discourse units (EDUs) and discourse relations, and learns the discourse-aware features via a graph network for downstream QA tasks.
Ranked #9 on Reading Comprehension on ReClor
Here, we explore a dynamic network slimming regime, named Dynamic Slimmable Network (DS-Net), which aims to achieve good hardware-efficiency via dynamically adjusting filter numbers of networks at test time with respect to different inputs, while keeping filters stored statically and contiguously in hardware to prevent the extra burden.
In this work, we present Block-wisely Self-supervised Neural Architecture Search (BossNAS), an unsupervised NAS method that addresses the problem of inaccurate architecture rating caused by large weight-sharing space and biased supervision in previous methods.
1 code implementation • • Zhengzhong Liu, Guanxiong Ding, Avinash Bukkittu, Mansi Gupta, Pengzhi Gao, Atif Ahmed, Shikun Zhang, Xin Gao, Swapnil Singhavi, Linwei Li, Wei Wei, Zecong Hu, Haoran Shi, Haoying Zhang, Xiaodan Liang, Teruko Mitamura, Eric P. Xing, Zhiting Hu
Empirical natural language processing (NLP) systems in application domains (e. g., healthcare, finance, education) involve interoperation among multiple components, ranging from data ingestion, human annotation, to text retrieval, analysis, generation, and visualization.
A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions.
For object detection, the well-established classification and regression loss functions have been carefully designed by considering diverse learning challenges.
Non-parallel text style transfer has attracted increasing research interests in recent years.
Prior highly-tuned image parsing models are usually studied in a certain domain with a specific set of semantic labels and can hardly be adapted into other scenarios (e. g., sharing discrepant label granularity) without extensive re-training.
Recent advances in multi-agent reinforcement learning have been largely limited in training one model from scratch for every new task.
Beyond generating long and topic-coherent paragraphs in traditional captioning tasks, the medical image report composition task poses more task-oriented challenges by requiring both the highly-accurate medical term diagnosis and multiple heterogeneous forms of information including impression and findings.
To solve this issue, we propose a novel DynamIc Self-sUperviSed Erasure (DISUSE) which adaptively erases redundant and artifactual clues in the context and questions to learn and establish the correct corresponding pair relations between the questions and their clues.
The curiosity is added to the target entropy to increase the entropy temperature for unfamiliar states and decrease the target entropy for familiar states.
The resulting model zoo is more training efficient than SOTA NAS models, e. g. 6x faster than RegNetY-16GF, and 1. 7x faster than EfficientNetB3.
While existing NAS methods mostly design architectures on one single task, algorithms that look beyond single-task search are surging to pursue a more efficient and universal solution across various tasks.
Recent advances in multi-agent reinforcement learning have been largely limited in training one model from scratch for every new task.
It is crucial since the quality of the evidence is the key to answering commonsense questions, and even determines the upper bound on the QA systems performance.
Besides, we develop a Graph-Evolving Meta-Learning (GEML) framework that learns to evolve the commonsense graph for reasoning disease-symptom correlations in a new disease, which effectively alleviates the needs of a large number of dialogues.
When sampling tasks in MML-ASR, AMS adaptively determines the task sampling probability for each source language.
Specifically, we generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs to disentangle the knowledge from other biases.
Panoptic segmentation that unifies instance segmentation and semantic segmentation has recently attracted increasing attention.
Ranked #12 on Panoptic Segmentation on COCO test-dev
Although deep reinforcement learning (RL) has been successfully applied to a variety of robotic control tasks, it's still challenging to apply it to real-world tasks, due to the poor sample efficiency.
It is time to step back and think about the robustness of partially supervised methods and how to maximally utilize small-scale and partially labeled data for medical image segmentation tasks.
In this work, we propose an efficient, cooperative and highly automated framework to simultaneously search for all main components including backbone, segmentation branches, and feature fusion module in a unified panoptic segmentation pipeline based on the prevailing one-shot Network Architecture Search (NAS) paradigm.
In this paper, we develop a general framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training.
How to discriminatively vectorize graphs is a fundamental challenge that attracts increasing attentions in recent years.
To push forward the future research on building expert-sensitive medical dialogue system, we proposes two kinds of medical dialogue tasks based on MedDG dataset.
A practical automatic textual math word problems (MWPs) solver should be able to solve various textual MWPs while most existing works only focused on one-unknown linear MWPs.
Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation.
In this paper, we propose a novel lane-sensitive architecture search framework named CurveLane-NAS to automatically capture both long-ranged coherent and accurate short-range curve information while unifying both architecture search and post-processing on curve lane predictions via point blending.
Ranked #12 on Lane Detection on CULane
In spite of its remarkable progress, many algorithms are restricted to particular search spaces.
Ranked #3 on Neural Architecture Search on NAS-Bench-201, ImageNet-16-120 (Accuracy (val) metric)
Firstly, the regions of primary interest to radiologists are usually located in a small area of the global image, meaning that the remainder parts of the image could be considered as irrelevant noise in the training procedure.
We introduce a Bidirectional Graph Reasoning Network (BGRNet), which incorporates graph structure into the conventional panoptic segmentation network to mine the intra-modular and intermodular relations within and between foreground things and background stuff classes.
Inspired by the property of a capsule network that can carve a tree structure inside a regular convolutional neural network (CNN), we propose a hierarchical compositional reasoning model called the "Linguistically driven Graph Capsule Network", where the compositional process is guided by the linguistic parse tree.
Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step.
To address this problem, this paper presents a propensity-based patient simulator (PBPS), which is capable of facilitating the training of MAD agents by generating informative counterfactual answers along with the disease diagnosis.
Is a hand-crafted detection network tailored for natural image undoubtedly good enough over a discrepant medical lesion domain?
Finally, an InterDomain Transfer Module is proposed to exploit diverse transfer dependencies across all domains and enhance the regional feature representation by attending and transferring semantic contexts globally.
Target-guided open-domain conversation aims to proactively and naturally guide a dialogue agent or human to achieve specific goals, topics or keywords during open-ended conversations.
Moreover, we find that the knowledge of a network model lies not only in the network parameters but also in the network architecture.
In this paper, we present a two-stage coarse-to-fine searching strategy named Structural-to-Modular NAS (SM-NAS) for searching a GPU-friendly design of both an efficient combination of modules and better modular-level architecture for object detection.
In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to take advantage of the additional training signals derived from the semantic information.
Ranked #7 on Vision and Language Navigation on VLN Challenge
Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement.
Each Layout-Graph Reasoning(LGR) layer aims to map feature representations into structural graph nodes via a Map-to-Node module, performs reasoning over structural graph nodes to achieve global layout coherency via a layout-graph reasoning module, and then maps graph nodes back to enhance feature representations via a Node-to-Map module.
Resembling the rapid learning capability of human, few-shot learning empowers vision systems to understand new concepts by training with few samples.
Ranked #8 on Few-Shot Object Detection on MS-COCO (30-shot)
Explanation and high-order reasoning capabilities are crucial for real-world visual question answering with diverse levels of inference complexity (e. g., what is the dog that is near the girl playing with?)
(Unsupervised) Domain Adaptation (DA) seeks for classifying target instances when solely provided with source labeled and target unlabeled examples for training.
Ranked #2 on Multi-target Domain Adaptation on Office-Home
A broad range of cross-$m$-domain generation researches boil down to matching a joint distribution by deep generative models (DGMs).
Interactive fashion image manipulation, which enables users to edit images with sketches and color strokes, is an interesting research problem with great application value.
Learning semantic configurations and activation of modules to align well with structured knowledge can be regarded as a decision-making procedure, which is solved by a new graph-based reinforcement learning algorithm.
Graph neural networks (GNN) have gained increasing research interests as a mean to the challenging goal of robust and universal graph learning.
By distilling universal semantic graph representation to each specific task, Graphonomy is able to predict all levels of parsing labels in one system without piling up the complexity.
Generating long and semantic-coherent reports to describe medical images poses great challenges towards bridging visual and linguistic modalities, incorporating medical domain knowledge, and generating realistic and accurate descriptions.
Given an input person image, a desired clothes image, and a desired pose, the proposed Multi-pose Guided Virtual Try-on Network (MG-VTON) can generate a new person image after fitting the desired clothes into the input image and manipulating human poses.
Ranked #1 on Virtual Try-on on Deep-Fashion
Besides the challenges for conversational dialogue systems (e. g. topic transition coherency and question understanding), automatic medical diagnosis further poses more critical requirements for the dialogue rationality in the context of medical knowledge and symptom-disease relations.
That is, the model learns to imitate the writing style of any given exemplar sentence, with automatic adaptions to faithfully describe the content record.
To cooperate with local convolutions, each SGR is constituted by three modules: a) a primal local-to-semantic voting module where the features of all symbolic nodes are generated by voting from local representations; b) a graph reasoning module propagates information over knowledge graph to achieve global semantic coherency; c) a dual semantic-to-local mapping module learns new associations of the evolved symbolic nodes with local representations, and accordingly enhances local features.
Ranked #38 on Semantic Segmentation on ADE20K val
Despite remarkable advances in image synthesis research, existing works often fail in manipulating images under the context of large geometric transformations.
Many machine learning problems involve iteratively and alternately optimizing different task objectives with respect to different sets of parameters.
Collaborative reasoning for understanding image-question pairs is a very critical but underexplored topic in interpretable visual question answering systems.
3 code implementations • • Zhiting Hu, Haoran Shi, Bowen Tan, Wentao Wang, Zichao Yang, Tiancheng Zhao, Junxian He, Lianhui Qin, Di Wang, Xuezhe Ma, Zhengzhong Liu, Xiaodan Liang, Wangrong Zhu, Devendra Singh Sachan, Eric P. Xing
The versatile toolkit also fosters technique sharing across different text generation tasks.
In this paper, we address this problem by training relational context-aware agents which learn the actions to localize the target person from the gallery of whole scene images.
Despite the promising results on paired/unpaired image-to-image translation achieved by Generative Adversarial Networks (GANs), prior works often only transfer the low-level information (e. g. color or texture changes), but fail to manipulate high-level semantic meanings (e. g., geometric structure or content) of different object regions.
We explore an approach to forecasting human motion in a few milliseconds given an input 3D skeleton sequence based on a recurrent encoder-decoder framework.
To our best knowledge, this is the first work to model the multi-view clustering in a deep joint framework, which will provide a meaningful thinking in unsupervised multi-view learning.
Beyond the existing single-person and multiple-person human parsing tasks in static images, this paper makes the first attempt to investigate a more realistic video instance-level human parsing that simultaneously segments out each person instance and parses each instance into more fine-grained parts (e. g., head, leg, dress).
Instance-level human parsing towards real-world human analysis scenarios is still under-explored due to the absence of sufficient data resources and technical difficulty in parsing multiple instances in a single pass.
Ranked #3 on Human Part Segmentation on CIHP
Motivated by the zoom-in operation of a pathologist using a digital microscope, RAZN learns a policy network to decide whether zooming is required in a given region of interest.
Second, to alleviate boundary artifacts of warped clothes and make the results more realistic, we employ a Try-On Module that learns a composition mask to integrate the warped clothes and the rendered image to ensure smoothness.
Video summarization plays an important role in video understanding by selecting key frames/shots.
To our knowledge, this is the first successful case of the learned driving policy through reinforcement learning in the high-fidelity simulator, which performs better-than supervised imitation learning.
Specifically, we propose a model that enforces our intuition that prediction masks should be domain independent.
Analogously, this paper introduces geometric generalization based zero-shot learning tests to measure the rapid learning ability and the internal consistency of deep generative models.
no code implementations • • Zhiting Hu, Zichao Yang, Tiancheng Zhao, Haoran Shi, Junxian He, Di Wang, Xuezhe Ma, Zhengzhong Liu, Xiaodan Liang, Lianhui Qin, Devendra Singh Chaplot, Bowen Tan, Xingjiang Yu, Eric Xing
The features make Texar particularly suitable for technique sharing and generalization across different text generation applications.
In this paper, we formulate this problem as a Markov Decision Process, where agents are learned to segment object regions under a deep reinforcement learning framework.
Graph convolutional neural networks have recently shown great potential for the task of zero-shot learning.
Experiments show that our approach achieves the state-of-the-art results on two medical report datasets, generating well-balanced structured sentences with robust coverage of heterogeneous medical report contents.
Cellular Electron CryoTomography (CECT) is a 3D imaging technique that captures information about the structure and spatial organization of macromolecular complexes within single cells, in near-native state and at sub-molecular resolution.
Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner.
We argue that semantic salient segmentation can instead be effectively resolved by reformulating it as a simple yet intuitive pixel-pair based connectivity prediction task.
To further explore and take advantage of the semantic correlation of these two tasks, we propose a novel joint human parsing and pose estimation network to explore efficient context modeling, which can simultaneously predict parsing and pose with extremely high quality.
Ranked #7 on Semantic Segmentation on LIP val
This network comprises of two collaborative modules: i) an adversarial attention module to exploit the local visual evidence for each word parsed from the question; ii) a residual composition module to compose the previously mined evidence.
Moreoever, we demonstrate a universal segmentation model that is jointly trained on diverse datasets can surpass the performance of the common fine-tuning scheme for exploiting multiple domain knowledge.
Ranked #41 on Semantic Segmentation on ADE20K val
Cellular Electron Cryo-Tomography (CECT) is a powerful imaging technique for the 3D visualization of cellular structure and organization at submolecular resolution.
In the spectrum of vision-based autonomous driving, vanilla end-to-end models are not interpretable and suboptimal in performance, while mediated perception models require additional intermediate representations such as segmentation masks or detection bounding boxes, whose annotation can be prohibitively expensive as we move to a larger scale.
It presents two main contributions to traditional GANs: 1) a soft gradient-sensitive objective for keeping semantic boundaries; 2) a semantic-aware discriminator for validating the fidelity of personalized adaptions with respect to each semantic region.
Unsupervised video summarization plays an important role on digesting, browsing, and searching the ever-growing videos every day, and the underlying fine-grained semantic and motion information (i. e., objects of interest and their key motions) in online videos has been barely touched.
We study the problem of conditional generative modeling based on designated semantics or structures.
An intuition on human segmentation is that when a human is moving in a video, the video-context (e. g., appearance and motion clues) may potentially infer reasonable mask information for the whole human body.
Face hallucination is a domain-specific super-resolution problem with the goal to generate high-resolution (HR) faces from low-resolution (LR) input images.
A common issue, however, is that objects of interest that are not involved in human actions are often absent in global action descriptions known as "missing label".
Ranked #3 on Weakly Supervised Object Detection on Charades
Generative Adversarial Networks (GANs) have recently achieved significant improvement on paired/unpaired image-to-image translation, such as photo$\rightarrow$ sketch and artist painting style transfer.
Ranked #3 on Facial Expression Translation on CelebA
To make both synthesized future frames and flows indistinguishable from reality, a dual adversarial training method is proposed to ensure that the future-flow prediction is able to help infer realistic future-frames, while the future-frame prediction in turn leads to realistic optical flows.
3D human articulated pose recovery from monocular image sequences is very challenging due to the diverse appearances, viewpoints, occlusions, and also the human 3D pose is inherently ambiguous from the monocular imagery.
In this work, we address the small object detection problem by developing a single architecture that internally lifts representations of small objects to "super-resolved" ones, achieving similar characteristics as large objects and thus more discriminative for detection.
We show that Poseidon enables Caffe and TensorFlow to achieve 15. 5x speed-up on 16 single-GPU machines, even with limited bandwidth (10GbE) and the challenging VGG19-22K network for image classification.
Through this adversarial process the critic network learns the higher order structures and guides the segmentation model to achieve realistic segmentation outcomes.
We investigate a principle way to progressively mine discriminative object regions using classification networks to address the weakly-supervised semantic segmentation problems.
In this work, we propose hierarchical nonparametric variational autoencoders, which combines tree-structured Bayesian nonparametric priors with VAEs, to enable infinite flexibility of the latent representation space.
The proposed Recurrent Topic-Transition Generative Adversarial Network (RTT-GAN) builds an adversarial framework between a structured paragraph generator and multi-level paragraph discriminators.
We cast this problem as manipulating an input image according to a parametric model whose key parameters can be conditionally generated from any guiding signal (even unseen ones).
Human parsing has recently attracted a lot of research interests due to its huge application potentials.
Ranked #10 on Semantic Segmentation on LIP val
Therefore, Tree-RL can better cover different objects with various scales which is quite appealing in the context of object proposal.
To capture such global interdependency, we propose a deep Variation-structured Reinforcement Learning (VRL) framework to sequentially discover object relationships and attributes in the whole image.
Instead of learning LSTM models over the pre-fixed structures, we propose to further learn the intermediate interpretable multi-level graph structures in a progressive and stochastic way from data during the LSTM network optimization.
Generic generation and manipulation of text is challenging and has limited success compared to recent deep generative modeling in visual domain.
Most of existing detection pipelines treat object proposals independently and predict bounding box locations and classification scores over them separately.
In this paper, we propose a novel inference-embedded multi-task learning framework for predicting human pose from still depth images, which is implemented with a deep architecture of neural networks.
Objective functions for training of deep networks for face-related recognition tasks, such as facial expression recognition (FER), usually consider each sample independently.
Ranked #2 on Facial Expression Recognition on Oulu-CASIA
Detecting pedestrian has been arguably addressed as a special topic beyond general object detection.
Ranked #14 on Pedestrian Detection on Caltech
Another long short-term memorized fusion layer is set up to integrate the contexts along the vertical direction from different channels, and perform bi-directional propagation of the fused vertical contexts along the horizontal direction to obtain true 2D global contexts.
This paper addresses a fundamental problem of scene understanding: How to parse the scene image into a structured configuration (i. e., a semantic object hierarchy with object interaction relations) that finely accords with human perception.
This paper addresses the problem of geometric scene parsing, i. e. simultaneously labeling geometric surfaces (e. g. sky, ground and vertical plane) and determining the interaction relations (e. g. layering, supporting, siding and affinity) between main regions.
We provide preliminary answers to these questions through developing a novel Attention to Context Convolution Neural Network (AC-CNN) based object detection model.
By taking the semantic object parsing task as an exemplar application scenario, we propose the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data.
In particular, in order to improve the localization accuracy, a fully convolutional network is employed which predicts locations of object proposals for each pixel.
Then the concept detector can be fine-tuned based on these new instances.
In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (Co-CNN) architecture, which well integrates the cross-layer context, global image-level context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network.
The long chains of sequential computation by stacked LG-LSTM layers also enable each pixel to sense a much larger region for inference benefiting from the memorization of previous dependencies in all positions along all dimensions.
By being reversible, the proposal refinement sub-network adaptively determines an optimal number of refinement iterations required for each proposal during both training and testing.
Taking pedestrian detection as an example, we illustrate how we can leverage this philosophy to develop a Scale-Aware Fast R-CNN (SAF R-CNN) framework.
Ranked #18 on Pedestrian Detection on Caltech
Then, a better network called Enhanced-DCNN is learned with supervision from the predicted segmentation masks of simple images based on the Initial-DCNN as well as the image-level annotations.
Instance-level object segmentation is an important yet under-explored task.
Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict the matching confidence and displacements of the best matched region in the testing image for a particular semantic region in one KNN image.
The first CNN network is with max-pooling, and designed to predict the template coefficients for each label mask, while the second CNN network is without max-pooling to preserve sensitivity to label mask position and accurately predict the active shape parameters.
The aim of this study is to provide an automatic computational framework to assist clinicians in diagnosing Focal Liver Lesions (FLLs) in Contrast-Enhancement Ultrasound (CEUS).
Although it has been widely discussed in video surveillance, background subtraction is still an open problem in the context of complex scenarios, e. g., dynamic backgrounds, illumination variations, and indistinct foreground objects.