However, the crucial navigation clues (i. e., object-level environment layout) for embodied navigation task is discarded since the maintained vector is essentially unstructured.
3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description.
RS takes previous detected results as references to aggregate the corresponding features from the combined features of the adjacent frames and makes a one-to-one track state prediction for each reference in parallel.
In this paper, we reveal and address the disadvantages of the conventional query-driven HOI detectors from the two aspects.
Ranked #2 on Human-Object Interaction Detection on HICO-DET
In this paper, we present a novel Distribution-Aware Single-stage (DAS) model for tackling the challenging multi-person 3D pose estimation problem.
In contrast, the 2D grid-based methods, such as PointPillar, can easily achieve a stable and efficient speed based on simple 2D convolution, but it is hard to get the competitive accuracy limited by the coarse-grained point clouds representation.
To this end, we propose a novel one-stage framework with disentangling human-object detection and interaction classification in a cascade manner.
Ranked #3 on Human-Object Interaction Detection on V-COCO
Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents.
The Remote Embodied Referring Expression (REVERIE) is a recently raised task that requires an agent to navigate to and localise a referred remote object according to a high-level language instruction.
In this paper, we are tackling the weakly-supervised referring expression grounding task, for the localization of a referent object in an image according to a query sentence, where the mapping between image regions and queries are not available during the training stage.
In this paper, we address the makeup transfer and removal tasks simultaneously, which aim to transfer the makeup from a reference image to a source image and remove the makeup from the with-makeup image respectively.
For the above exemplar case, our HRS task produces results in the form of relation triplets <girl [left hand], hold, book> and exacts segmentation masks of the book, with which the robot can easily accomplish the grabbing task.
In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models.
Ranked #4 on Referring Expression Segmentation on J-HMDB
Though 3D convolutions are amenable to recognizing which actor is performing the queried actions, it also inevitably introduces misaligned spatial information from adjacent frames, which confuses features of the target frame and yields inaccurate segmentation.
Ranked #5 on Referring Expression Segmentation on J-HMDB
To attain this, we map a trainable interaction query set to an interaction prediction set with a transformer.
Ranked #12 on Human-Object Interaction Detection on HICO-DET (using extra training data)
To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner.
In recent years, knowledge distillation has been proved to be an effective solution for model compression.
Considering the complexity of doing visual relation detection in videos, we decompose this task into three sub-tasks: object detection, trajectory proposal and relation prediction.
Our ORDNet is able to extract more comprehensive context information and well adapt to complex spatial variance in scene images.
Given the cycle, we propose several free augmentation strategies to help our model understand various editing requests given the imbalanced dataset.
Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens.
HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.
In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information.
Ranked #5 on Referring Expression Segmentation on RefCOCO testB
Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence.
LGR module utilizes body skeleton knowledge to construct a layout graph that connects all relevant part features, where graph reasoning mechanism is used to propagate information among part nodes to mine their relations.
Temporally language grounding in untrimmed videos is a newly-raised task in video understanding.
Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points.
Ranked #16 on Human-Object Interaction Detection on V-COCO
In this paper, we propose an AdversarialNAS method specially tailored for Generative Adversarial Networks (GANs) to search for a superior generative model on the task of unconditional image generation.
Representation learning on a knowledge graph (KG) is to embed entities and relations of a KG into low-dimensional continuous vector spaces.
First, it can exploit pixel alignment and feature alignment jointly.
Visual relationship recognition models are limited in the ability to generalize from finite seen predicates to unseen ones.
RCCF reformulates the referring expression comprehension as a correlation filtering process.
In this paper, we address the makeup transfer task, which aims to transfer the makeup from a reference image to a source image.
To address this issue, we propose a method called Untraceable GAN, which has a novel source classifier to differentiate which domain an image is translated from, and determines whether the translated image still retains the characteristics of the source domain.
In this paper, we propose a design scheme for deep learning networks in the face parsing task with promising accuracy and real-time inference speed.
Ranked #6 on Face Parsing on CelebAMask-HQ
The age discriminative network guides the synthesized face to fit the real conditional distribution.
Our proposed model explicitly learns a feature compensation network, which is specialized for mitigating the cross-domain differences.
Finally, an automatic portrait animation system based on fast deep matting is built on mobile devices, which does not need any interaction and can realize real-time matting with 15 fps.
In this paper, we develop a Single frame Video Parsing (SVP) method which requires only one labeled frame per video in training stage.
In this study, we present a weakly supervised approach that discovers the discriminative structures of sketch images, given pairs of sketch images and web images.
In this paper, we propose a novel Deep Localized Makeup Transfer Network to automatically recommend the most suitable makeup for a female and synthesis the makeup on her face.
We introduce a low-rank tensor constraint to explore the complementary information from multiple views and, accordingly, establish a novel method called Low-rank Tensor constrained Multiview Subspace Clustering (LT-MSC).
In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (Co-CNN) architecture, which well integrates the cross-layer context, global image-level context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network.
Then the concept detector can be fine-tuned based on these new instances.
In this paper, we focus on how to boost the multi-view clustering by exploring the complementary information among multi-view features.
Sparse representation has been applied to visual tracking by finding the best target candidate with minimal reconstruction error by use of target templates.
Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict the matching confidence and displacements of the best matched region in the testing image for a particular semantic region in one KNN image.
The first CNN network is with max-pooling, and designed to predict the template coefficients for each label mask, while the second CNN network is without max-pooling to preserve sensitivity to label mask position and accurately predict the active shape parameters.