In autonomous driving, predicting future events in advance and evaluating the foreseeable risks empowers autonomous vehicles to better plan their actions, enhancing safety and efficiency on the road.
For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data and retains visual-concept alignment through knowledge distillation.
Specifically, our model contains two key components: the Commonsense-based Contrastive Learning and the Graph Relation Network.
1 code implementation • 1 Oct 2023 • Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhu Chen, Jie Fu, Junran Peng
The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters.
One-shot domain adaptation methods attempt to overcome these challenges by transferring the pre-trained source model to the target domain using only one target data.
As it is empirically observed that Vision Transformers (ViTs) are quite insensitive to the order of input tokens, the need for an appropriate self-supervised pretext task that enhances the location awareness of ViTs is becoming evident.
On top of the proposed TFA, we further introduce a test-time adaptation (TTA) mechanism to refine anomaly localization results, where a layer of trainable parameters in the adapter is optimized using TFA's pseudo-labels and synthetic noise-corrupted tokens.
Consequently, we develop a suite of components to complement the virtual voxel concept, including a virtual voxel encoder, a virtual voxel mixer, and a virtual voxel assignment strategy.
The Class Incremental Semantic Segmentation (CISS) extends the traditional segmentation task by incrementally learning newly added classes.
Radar is ubiquitous in autonomous driving systems due to its low cost and good adaptability to bad weather.
Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations.
However, there is a lack of a universal and fair benchmark for evaluating AD methods on medical images, which hinders the development of more generalized and robust AD methods in this specific domain.
The framework of visually-guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
In this work, we address this limitation by studying camera-based 3D panoptic segmentation, aiming to achieve a unified occupancy representation for camera-only 3D scene understanding.
Learning from completely reconstructed objects in global BA, GBA-Learner predicts pseudo labels for occluded objects.
In this paper, we rethink the data association in 2D MOT and utilize the 3D object representation to separate each object in the feature space.
A common practice is to select the highly confident predictions as the pseudo-ground-truths for each pixel, but it leads to a problem that most pixels may be left unused due to their unreliability.
These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions.
To this end, we propose T2S-DA, which we interpret as a form of pulling Target to Source for Domain Adaptation, encouraging the model in learning similar cross-domain features.
Data and model are the undoubtable two supporting pillars for LiDAR object detection.
In this paper, we study how to effectively leverage image modality in the emerging fully sparse architecture.
Drawing inspiration from this, we propose a high-performance offline detector in a track-centric perspective instead of the conventional object-centric perspective.
We explore long-term temporal visual correspondence-based optimization for 3D video object detection in this work.
For image matching, our method outperforms state-of-the-art methods with half training data and iterations on a popular indoor dataset, ScanNet.
The function of constructing the hierarchy of objects is important to the visual process of the human brain.
In this paper, we present two conditions to ensure that the model could converge to a flat minimum with a small loss, and present an algorithm, named Sharpness-Aware Gradient Matching (SAGM), to meet the two conditions for improving model generalization capability.
Research into Cross-Domain Few-Shot (CDFS) has emerged to address this issue, forming a more challenging and realistic setting.
Prior work usually requires specific guidance such as the flickering frequency, manual annotations, or extra consistent videos to remove the flicker.
To this end, we construct a large-scale, multi-reference super-resolution dataset, named LMR.
The ability to discover abstract physical concepts and understand how they work in the world through observing lies at the core of human intelligence.
Pairwise learning strategies are prevalent for optimizing recommendation models on implicit feedback data, which usually learns user preference by discriminating between positive (i. e., clicked by a user) and negative items (i. e., obtained by negative sampling).
The transformation of features from 2D perspective space to 3D space is essential to multi-view 3D object detection.
Without introducing any external supervision and human priors, the proposed FPR effectively suppresses wrong activations from the background objects.
Moreover, we find that the image feature maps' resolution in the cross-attention module has a limited effect on the final performance.
Surrogate gradient (SG) is one of the most effective approaches for training spiking neural networks (SNNs).
Second, we train image-to-image translation networks on the synthesized datasets, enabling semantic-conditional image synthesis without human annotations.
The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset.
Ranked #4 on 3D Object Detection on Rope3D
While previous state-of-the-art RefSR methods mainly focus on improving the efficacy and robustness of reference feature transfer, it is generally overlooked that a well reconstructed SR image should enable better SR reconstruction for its similar LR images when it is referred to as.
In this paper, we propose 4D unsupervised object discovery, jointly discovering objects from 4D data -- 3D point clouds and 2D RGB images with temporal information.
To address this limitation, we present the MemoNav, a novel memory mechanism for image-goal navigation, which retains the agent's informative short-term memory and long-term memory to improve the navigation performance on a multi-goal task.
To this end, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
So these methods only use a small number of projection constraints and produce insufficient depth candidates, leading to inaccurate depth estimation.
Specifically, we generate support samples from actual samples and their neighbouring clusters in the embedding space through a progressive linear interpolation (PLI) strategy.
Sound source localization in visual scenes aims to localize objects emitting the sound in a given image.
In this paper, we propose a conceptually novel, efficient, and fully convolutional framework for real-time instance segmentation.
Ranked #8 on Real-time Instance Segmentation on MSCOCO
Capsule networks are designed to present the objects by a set of parts and their relationships, which provide an insight into the procedure of visual perception.
However, due to the high training costs and the unconsciousness of downstream usages, most self-supervised learning methods lack the capability to correspond to the diversities of downstream scenarios, as there are various data domains, different vision tasks and latency constraints on models.
Inspired by recent progresses of Vision Transformer (ViT) and Swin Transformer, we found that combining the local-aware attention mechanism with the global-related feature learning could meet the expectation in image compression.
Ranked #1 on Image Compression on kodak
Moreover, through experiments we show that discrete language representation has several advantages compared with continuous feature representation, from the aspects of interpretability, generalization, and robustness.
Existing methods usually generate pseudo labels from class activation map (CAM) and then train a segmentation model.
To remedy this problem, we propose an interesting and challenging cross-domain few-shot semantic segmentation task, where the training and test tasks perform on different domains.
The deep stereo models have achieved state-of-the-art performance on driving scenes, but they suffer from severe performance degradation when tested on unseen scenes.
In LiDAR-based 3D object detection for autonomous driving, the ratio of the object size to input scene size is significantly smaller compared to 2D detection cases.
Ranked #3 on 3D Object Detection on waymo cyclist
We employ a simple Kalman filter for trajectory prediction and preserve the tracklet by prediction when the target is not visible.
In this paper, we work on object dynamics and propose Object Dynamics Distillation Network (ODDN), a framework that distillates explicit object dynamics (e. g., velocity) from sequential static representations.
It is the first work to use negative pseudo labels during self-training for domain adaptation.
Transfer learning with pre-training on large-scale datasets has played an increasingly significant role in computer vision and natural language processing recently.
Inpainting arbitrary missing regions is challenging because learning valid features for various masked regions is nontrivial.
Ranked #4 on Image Inpainting on CelebA-HQ
A practical long-term tracker typically contains three key properties, i. e. an efficient model design, an effective global re-detection strategy and a robust distractor awareness mechanism.
In this work, we propose a new method called RefineMask for high-quality instance segmentation of objects and scenes, which incorporates fine-grained features during the instance-wise segmenting process in a multi-stage manner.
Tremendous efforts have been made on instance segmentation but the mask quality is still not satisfactory.
Our motivation is that regressing keypoint positions accurately needs to learn representations that focus on the keypoint regions.
Then the association problem turns into a general graph matching between tracklet graph and detection graph.
The most notable difference with previous works is that our method is purely based on the range view representation.
We first analyze the existing range-view-based methods and find two issues overlooked by previous works: 1) the scale variation between nearby and far away objects; 2) the inconsistency between the 2D range image coordinates used in feature extraction and the 3D Cartesian coordinates used in output.
Unsupervised domain adaptation for semantic segmentation aims to assign the pixel-level labels for unlabeled target domain by transferring knowledge from the labeled source domain.
This work argues that these approaches in fact are not aware of clothing status (i. e., change or no-change) of a pedestrian.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths, which can be used for training more accurate segmentation models.
Ranked #29 on Weakly-Supervised Semantic Segmentation on COCO 2014 val (using extra training data)
We further identify another major issue, seldom noticed by the community, that the long-tailed and open-ended (sub-)category distribution should be accommodated.
In this paper, we propose a manual-label free 3D detection algorithm that leverages the CARLA simulator to generate a large amount of self-labeled training samples and introduces a novel Domain Adaptive VoxelNet (DA-VoxelNet) that can cross the distribution gap from the synthetic data to the real scenario.
We name the proposed 3D shape search engine, which combines GPU acceleration and Inverted File Twice, as GIFT.