Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries.
Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous-domain knowledge can be hardly directly deployed to a new domain without additional costs.
Specifically, we analyze the performance changes of different methods under different bandwidths, providing a deep insight into the performance-bandwidth trade-off issue.
CCPD transfers the fundamental, point-to-point wayfinding skill that is well trained on the large-scale PointGoal task to ORAN, so as to help ORAN to better master audio-visual navigation with far fewer training samples.
To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation.
no code implementations • 29 Jun 2023 • Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi Li, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenhu Chen, Wei Xue, Yike Guo
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal.
no code implementations • 18 Jun 2023 • Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Le Zhuo, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenberg, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang, Ruibo Liu, Yike Guo, Jie Fu
This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark.
In this paper, we study existing approaches and identify a dominant factor in defining tight approximation, namely the approximation domain of the activation function.
Besides, the generated pseudo-labels can be fluctuating and inaccurate at the early stage of training.
Second, through our design, the object queries and the foreground query in the decoder share consensus on the class semantics, therefore making the strong and weak supervision mutually benefit each other for domain alignment.
With the prevalence of multimodal learning, camera-LiDAR fusion has gained popularity in 3D object detection.
Via abstraction, all perturbed images are mapped into intervals before feeding into neural networks for training.
When extracting object knowledge from PVLMs, the former adaptively transforms object proposals and adopts object-aware mask attention to obtain precise and complete knowledge of objects.
Ranked #7 on Open Vocabulary Object Detection on MSCOCO
The first method is One-to-many Matching via Data Augmentation (denoted as DataAug-DETR).
An attempt has been made to get rid of BEV and predict 3D lanes from FV representations directly, while it still underperforms other BEV-based methods given its lack of structured representation for 3D lanes.
Ranked #3 on 3D Lane Detection on Apollo Synthetic 3D Lane
For the generated queries, we design a sparse cross attention module to force them to focus on the features of specific objects, which suppresses interference from noises.
In this paper, we propose an Adaptive Zone-aware Hierarchical Planner (AZHP) to explicitly divides the navigation process into two heterogeneous phases, i. e., sub-goal setting via zone partition/selection (high-level action) and sub-goal executing (low-level action), for hierarchical planning.
To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module which exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts.
Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model by a mask sampling mechanism to improve pre-training efficiency.
Ranked #31 on Video Retrieval on MSR-VTT-1kA (using extra training data)
Through simulating point cloud data in different LiDAR placements, we can evaluate the perception accuracy of these placements using multiple detection models.
In this paper, we present a novel training scheme, namely Teach-DETR, to learn better DETR-based detectors from versatile teacher detectors.
In this paper we propose a novel, tight and scalable reachability analysis approach for DRL systems.
We believe our dataset, benchmark model, and evaluation metric will boost the development of video background music generation.
We observe that existing approaches only rely on overestimated domains, while the corresponding tight approximation may not necessarily be tight on its actual domain.
To better bridge the domain gap between source domain (synthetic data) and target domain (real-world data), we also propose a Selective Feature Alignment (SFA) module which only aligns the features of consistent foreground area between the two domains, thus realizing inter-domain intra-modality adaptation.
Existing methods for human mesh recovery mainly focus on single-view frameworks, but they often fail to produce accurate results due to the ill-posed setup.
2 code implementations • 12 Sep 2022 • Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiangwei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong Jia, Si Liu, Jianping Shi, Dhaka Lin, Yu Qiao
As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.
Considerable efforts have been devoted to finding the so-called tighter approximations to obtain more precise verification results.
Human pose estimation aims to accurately estimate a wide variety of human poses.
To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals and outputs panoptic segmentation by simple combination.
Vision-language navigation is the task of directing an embodied agent to navigate in 3D scenes with natural language instructions.
We observe that the core difficulty for heterogeneous KD (hetero-KD) is the significant semantic gap between the backbone features of heterogeneous detectors due to the different optimization manners.
Referring video object segmentation aims to predict foreground labels for objects referred by natural language expressions in videos.
Ranked #5 on Referring Video Object Segmentation on MeViS
However, the crucial navigation clues (i. e., object-level environment layout) for embodied navigation task is discarded since the maintained vector is essentially unstructured.
3D visual grounding aims to locate the referred target object in 3D point cloud scenes according to a free-form language description.
RS takes previous detected results as references to aggregate the corresponding features from the combined features of the adjacent frames and makes a one-to-one track state prediction for each reference in parallel.
In this paper, we reveal and address the disadvantages of the conventional query-driven HOI detectors from the two aspects.
Ranked #11 on Human-Object Interaction Detection on HICO-DET
In this paper, we present a novel Distribution-Aware Single-stage (DAS) model for tackling the challenging multi-person 3D pose estimation problem.
In contrast, the 2D grid-based methods, such as PointPillar, can easily achieve a stable and efficient speed based on simple 2D convolution, but it is hard to get the competitive accuracy limited by the coarse-grained point clouds representation.
To this end, we propose a novel one-stage framework with disentangling human-object detection and interaction classification in a cascade manner.
Ranked #7 on Human-Object Interaction Detection on V-COCO
Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents.
The Remote Embodied Referring Expression (REVERIE) is a recently raised task that requires an agent to navigate to and localise a referred remote object according to a high-level language instruction.
In this paper, we are tackling the weakly-supervised referring expression grounding task, for the localization of a referent object in an image according to a query sentence, where the mapping between image regions and queries are not available during the training stage.
In this paper, we address the makeup transfer and removal tasks simultaneously, which aim to transfer the makeup from a reference image to a source image and remove the makeup from the with-makeup image respectively.
For the above exemplar case, our HRS task produces results in the form of relation triplets <girl [left hand], hold, book> and exacts segmentation masks of the book, with which the robot can easily accomplish the grabbing task.
In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models.
Ranked #7 on Referring Expression Segmentation on J-HMDB
Though 3D convolutions are amenable to recognizing which actor is performing the queried actions, it also inevitably introduces misaligned spatial information from adjacent frames, which confuses features of the target frame and yields inaccurate segmentation.
Ranked #8 on Referring Expression Segmentation on J-HMDB
To attain this, we map a trainable interaction query set to an interaction prediction set with a transformer.
Ranked #27 on Human-Object Interaction Detection on HICO-DET (using extra training data)
To address the challenging task of instance-aware human part parsing, a new bottom-up regime is proposed to learn category-level human semantic segmentation as well as multi-person pose estimation in a joint and end-to-end manner.
In recent years, knowledge distillation has been proved to be an effective solution for model compression.
Considering the complexity of doing visual relation detection in videos, we decompose this task into three sub-tasks: object detection, trajectory proposal and relation prediction.
Our ORDNet is able to extract more comprehensive context information and well adapt to complex spatial variance in scene images.
Given the cycle, we propose several free augmentation strategies to help our model understand various editing requests given the imbalanced dataset.
Our CNMT consists of a reading, a reasoning and a generation modules, in which Reading Module employs better OCR systems to enhance text reading ability and a confidence embedding to select the most noteworthy tokens.
HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization.
Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence.
In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information.
Ranked #11 on Referring Expression Segmentation on RefCOCO testB
LGR module utilizes body skeleton knowledge to construct a layout graph that connects all relevant part features, where graph reasoning mechanism is used to propagate information among part nodes to mine their relations.
Temporally language grounding in untrimmed videos is a newly-raised task in video understanding.
Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points.
Ranked #24 on Human-Object Interaction Detection on V-COCO
In this paper, we propose an AdversarialNAS method specially tailored for Generative Adversarial Networks (GANs) to search for a superior generative model on the task of unconditional image generation.
Representation learning on a knowledge graph (KG) is to embed entities and relations of a KG into low-dimensional continuous vector spaces.
First, it can exploit pixel alignment and feature alignment jointly.
Visual relationship recognition models are limited in the ability to generalize from finite seen predicates to unseen ones.
In this paper, we address the makeup transfer task, which aims to transfer the makeup from a reference image to a source image.
RCCF reformulates the referring expression comprehension as a correlation filtering process.
To address this issue, we propose a method called Untraceable GAN, which has a novel source classifier to differentiate which domain an image is translated from, and determines whether the translated image still retains the characteristics of the source domain.
In this paper, we propose a design scheme for deep learning networks in the face parsing task with promising accuracy and real-time inference speed.
Ranked #6 on Face Parsing on CelebAMask-HQ
The age discriminative network guides the synthesized face to fit the real conditional distribution.
Our proposed model explicitly learns a feature compensation network, which is specialized for mitigating the cross-domain differences.
Finally, an automatic portrait animation system based on fast deep matting is built on mobile devices, which does not need any interaction and can realize real-time matting with 15 fps.
In this paper, we develop a Single frame Video Parsing (SVP) method which requires only one labeled frame per video in training stage.
In this study, we present a weakly supervised approach that discovers the discriminative structures of sketch images, given pairs of sketch images and web images.
In this paper, we propose a novel Deep Localized Makeup Transfer Network to automatically recommend the most suitable makeup for a female and synthesis the makeup on her face.
We introduce a low-rank tensor constraint to explore the complementary information from multiple views and, accordingly, establish a novel method called Low-rank Tensor constrained Multiview Subspace Clustering (LT-MSC).
Then the concept detector can be fine-tuned based on these new instances.
In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (Co-CNN) architecture, which well integrates the cross-layer context, global image-level context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network.
In this paper, we focus on how to boost the multi-view clustering by exploring the complementary information among multi-view features.
Sparse representation has been applied to visual tracking by finding the best target candidate with minimal reconstruction error by use of target templates.
Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict the matching confidence and displacements of the best matched region in the testing image for a particular semantic region in one KNN image.
The first CNN network is with max-pooling, and designed to predict the template coefficients for each label mask, while the second CNN network is without max-pooling to preserve sensitivity to label mask position and accurately predict the active shape parameters.