no code implementations • • Yahui Long, Kok Siong Ang, Mengwei Li, Kian Long Kelvin Chong, Raman Sethi, Chengwei Zhong, Hang Xu, Zhiwei Ong, Karishma Sachaphibulkij, Ao Chen, Zeng Li, Huazhu Fu, Min Wu, Hsiu Kim Lina Lim, Longqi Liu, Jinmiao Chen
Lastly, compared to other methods, GraphST’s cell type deconvolution achieved higher accuracy on simulated data and better captured spatial niches such as the germinal centers of the lymph node in experimentally acquired data.
We thus propose a Query Contrast mechanism to explicitly enhance queries towards their best-matched GTs over all unmatched query predictions.
Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e. g., zero-shot classification, retrieval and image captioning.
Here, we make the first attempt to achieve generic text-guided cross-category 3D object generation via a new 3D-TOGO model, which integrates a text-to-views generation module and a views-to-3D generation module.
In this work, we focus on learning a VLP model with sequential chunks of image-text pair data.
We further design a concept dictionary~(with descriptions) from various online sources and detection datasets to provide prior knowledge for each concept.
Aiming towards a holistic understanding of multiple downstream tasks simultaneously, there is a need for extracting features with better transferability.
For this problem, we propose the Explainable Contrastive Language-Image Pre-training (ECLIP), which corrects the explainability via the Masked Max Pooling.
Self-supervised depth learning from monocular images normally relies on the 2D pixel-wise photometric relation between temporally adjacent image frames.
This is because most of the existing lane detection methods either treat the lane detection as a dense prediction or a detection task, few of them consider the unique topologies (Y-shape, Fork-shape, nearly horizontal lane) of the lane markers, which leads to sub-optimal solution.
To bridge the gap between supervised semantic segmentation and real-world applications that acquires one model to recognize arbitrary new concepts, recent zero-shot segmentation attracts a lot of attention by exploring the relationships between unseen and seen object categories, yet requiring large amounts of densely-annotated data with diverse base classes.
Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity.
In this paper, we propose CO^3, namely Cooperative Contrastive Learning and Contextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner.
A self-driving perception model aims to extract 3D semantic representations from multiple cameras collectively into the bird's-eye-view (BEV) coordinate frame of the ego car in order to ground downstream planner.
On the other hand, for existing SSL methods, it is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
To address this problem, we propose a noise-robust bi-level re-weighting framework which is able to learn the per-sample weights measuring the data quality without requiring any gold data.
Accurate and reliable 3D detection is vital for many applications including autonomous driving vehicles and service robots.
To make ROSETTA automatically determine which experience is available and useful, a prototypical task correlation guided Gating Diversity Controller(GDC) is introduced to adaptively adjust the diversity of gates for the new task based on class-specific prototypes.
We present ONCE-3DLanes, a real-world autonomous driving dataset with lane layout annotation in 3D space.
It is able to find top 0. 16\% and 0. 29\% architectures on average on two search spaces under the budget of only 50 models.
Existing text-guided image manipulation methods aim to modify the appearance of the image or to edit a few objects in a virtual or simple scenario, which is far from practical application.
We further propose a lightweight scene-to-sequence decoder that can auto-regressively generate words conditioned on features from a 3D scene as well as cues from the preceding words.
We present Laneformer, a conceptually simple yet powerful transformer-based architecture tailored for lane detection that is a long-standing research topic for visual perception in autonomous driving.
no code implementations • 15 Mar 2022 • Kaican Li, Kai Chen, Haoyu Wang, Lanqing Hong, Chaoqiang Ye, Jianhua Han, Yukuai Chen, Wei zhang, Chunjing Xu, Dit-yan Yeung, Xiaodan Liang, Zhenguo Li, Hang Xu
One main reason that impedes the development of truly reliably self-driving systems is the lack of public datasets for evaluating the performance of object detectors on corner cases.
To improve the ability of fast cross-domain adaptation, we propose Prompt-based Environmental Self-exploration (ProbES), which can self-explore the environments by sampling trajectories and automatically generates structured instructions via a large-scale cross-modal pretrained model (CLIP).
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields.
There is a growing interest in dataset generation recently due to the superior generative capacity of large pre-trained language models (PLMs).
Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods.
Ranked #6 on Zero-shot Image Retrieval on COCO-CN
First, 3D-CoCo is inspired by our observation that the bird-eye-view (BEV) features are more transferable than low-level geometry features.
In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective.
Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity.
In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2$\sim$3 times larger than MHA.
To resolve the problems, we propose a novel second-stage module, named pyramid RoI head, to adaptively learn the features from the sparse points of interest.
Ranked #2 on 3D Object Detection on waymo vehicle (AP metric)
We present Voxel Transformer (VoTr), a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Ranked #3 on 3D Object Detection on waymo vehicle (L1 mAP metric)
Extensive Unsupervised Domain Adaptation (UDA) studies have shown great success in practice by learning transferable representations across a labeled source domain and an unlabeled target domain with deep models.
By pre-training on SODA10M, a large-scale autonomous driving dataset, MultiSiam exceeds the ImageNet pre-trained MoCo-v2, demonstrating the potential of domain-specific pre-training.
In this paper, we first identify that spherical rectangles are unbiased bounding boxes for objects in spherical images, and then propose an analytical method for IoU calculation without any approximations.
In this paper, we investigate the knowledge distillation (KD) strategy for object detection and propose an effective framework applicable to both homogeneous and heterogeneous student-teacher pairs.
Fine-tuning from pre-trained ImageNet models has been a simple, effective, and popular approach for various computer vision tasks.
In this paper, we investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval among fine-grained product categories.
Transformer-based pre-trained language models like BERT and its variants have recently achieved promising performance in various natural language processing (NLP) tasks.
Ranked #10 on Semantic Textual Similarity on MRPC
Experiments show that SODA10M can serve as a promising pre-training dataset for different self-supervised learning methods, which gives superior performance when fine-tuning with different downstream tasks (i. e., detection, semantic/instance segmentation) in autonomous driving domain.
To facilitate future research on exploiting unlabeled data for 3D detection, we additionally provide a benchmark in which we reproduce and evaluate a variety of self-supervised and semi-supervised methods on the ONCE dataset.
For student morphism, weight inheritance strategy is adopted, allowing the student to flexibly update its architecture while fully utilize the predecessor's weights, which considerably accelerates the search; To facilitate dynamic distillation, an elastic teacher pool is trained via integrated progressive shrinking strategy, from which teacher detectors can be sampled without additional cost in subsequent searches.
While existing NAS methods mostly design architectures on a single task, algorithms that look beyond single-task search are surging to pursue a more efficient and universal solution across various tasks.
DeepReduce is orthogonal to existing gradient sparsifiers and can be applied in conjunction with them, transparently to the end-user, to significantly lower the communication overhead.
To address this problem, we develop a probability-based pruning algorithm, called batch whitening channel pruning (BWCP), which can stochastically discard unimportant channels by modeling the probability of a channel being activated.
Unsupervised pre-training aims at learning transferable features that are beneficial for downstream tasks.
A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions.
For object detection, the well-established classification and regression loss functions have been carefully designed by considering diverse learning challenges.
Unlike most recent methods that focused on improving accuracy of image classification, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection.
This paper introduces DeepReduce, a versatile framework for the compressed communication of sparse tensors, tailored for distributed deep learning.
This work presents a new fine-grained transparent object segmentation dataset, termed Trans10K-v2, extending Trans10K-v1, the first large-scale transparent object segmentation dataset.
Ranked #3 on Semantic Segmentation on Trans10K
While existing NAS methods mostly design architectures on one single task, algorithms that look beyond single-task search are surging to pursue a more efficient and universal solution across various tasks.
Here we present a novel self-supervised 3D Object detection framework that seamlessly integrates the geometry-aware contrast and clustering harmonization to lift the unsupervised 3D representation learning, named GCC-3D.
The semi-supervised semantic segmentation methods utilize the unlabeled data to increase the feature discriminative ability to alleviate the burden of the annotated data.
The resulting model zoo is more training efficient than SOTA NAS models, e. g. 6x faster than RegNetY-16GF, and 1. 7x faster than EfficientNetB3.
Owning to the unremitting efforts by a few institutes, significant progress has recently been made in designing superhuman AIs in No-limit Texas Hold'em (NLTH), the primary testbed for large-scale imperfect-information game research.
Panoptic segmentation that unifies instance segmentation and semantic segmentation has recently attracted increasing attention.
Ranked #16 on Panoptic Segmentation on COCO test-dev
1 code implementation • 3 Nov 2020 • Bochao Wang, Hang Xu, Jiajin Zhang, Chen Chen, Xiaozhi Fang, Yixing Xu, Ning Kang, Lanqing Hong, Chenhan Jiang, Xinyue Cai, Jiawei Li, Fengwei Zhou, Yong Li, Zhicheng Liu, Xinghao Chen, Kai Han, Han Shu, Dehua Song, Yunhe Wang, Wei zhang, Chunjing Xu, Zhenguo Li, Wenzhi Liu, Tong Zhang
Automated Machine Learning (AutoML) is an important industrial solution for automatic discovery and deployment of the machine learning models.
In this work, we propose an efficient, cooperative and highly automated framework to simultaneously search for all main components including backbone, segmentation branches, and feature fusion module in a unified panoptic segmentation pipeline based on the prevailing one-shot Network Architecture Search (NAS) paradigm.
For this task, we introduce a new video-based benchmark, the Driver Anomaly Detection (DAD) dataset, which contains normal driving videos together with a set of anomalous actions in its training set.
In this paper, we propose a novel lane-sensitive architecture search framework named CurveLane-NAS to automatically capture both long-ranged coherent and accurate short-range curve information while unifying both architecture search and post-processing on curve lane predictions via point blending.
Ranked #4 on Lane Detection on CurveLanes
By carefully analyzing the existing bounding box patterns on the feature hierarchy, we design a flexible and tight hyper-parameter space for anchor configurations.
In spite of its remarkable progress, many algorithms are restricted to particular search spaces.
Ranked #12 on Neural Architecture Search on NAS-Bench-201, ImageNet-16-120 (Accuracy (Val) metric)
The key ideas are two-fold: a) explicitly modeling the dependencies among joints and the relations between the pixels and the joints for better local feature representation learning; b) unifying the dense pixel-wise offset predictions and direct joint regression for end-to-end training.
Is a hand-crafted detection network tailored for natural image undoubtedly good enough over a discrepant medical lesion domain?
In this paper, we study the hybrid-supervised object detection problem, aiming to train a high quality detector with only a limited amount of fullyannotated data and fully exploiting cheap data with imagelevel labels.
Finally, an InterDomain Transfer Module is proposed to exploit diverse transfer dependencies across all domains and enhance the regional feature representation by attending and transferring semantic contexts globally.
In this paper, we present a two-stage coarse-to-fine searching strategy named Structural-to-Modular NAS (SM-NAS) for searching a GPU-friendly design of both an efficient combination of modules and better modular-level architecture for object detection.
In this work, we propose BONAS (Bayesian Optimized Neural Architecture Search), a sample-based NAS framework which is accelerated using weight-sharing to evaluate multiple related architectures simultaneously.
2 code implementations • 4 Nov 2019 • Kai Zhang, Shuhang Gu, Radu Timofte, Zheng Hui, Xiumei Wang, Xinbo Gao, Dongliang Xiong, Shuai Liu, Ruipeng Gang, Nan Nan, Chenghua Li, Xueyi Zou, Ning Kang, Zhan Wang, Hang Xu, Chaofeng Wang, Zheng Li, Lin-Lin Wang, Jun Shi, Wenyu Sun, Zhiqiang Lang, Jiangtao Nie, Wei Wei, Lei Zhang, Yazhe Niu, Peijin Zhuo, Xiangzhen Kong, Long Sun, Wenhao Wang
The challenge had 3 tracks.
Abstract Neural architecture search (NAS) has shown great potential in automating the manual process of designing a good CNN architecture for image classification.
Inspired by the nature of the graph structure of a neural network, we propose BOGCN-NAS, a NAS algorithm using Bayesian Optimization with Graph Convolutional Network (GCN) predictor.
The Neural Architecture Search (NAS) problem is typically formulated as a graph search problem where the goal is to learn the optimal operations over edges in order to maximise a graph-level global objective.
How to proper encode high-order object relation in the detection system without any external knowledge?
In this paper, we address the large-scale object detection problem with thousands of categories, which poses severe challenges due to long-tail data distributions, heavy occlusions, and class ambiguities.
The dominant object detection approaches treat the recognition of each region separately and overlook crucial semantic correlations between objects in one scene.