We propose a metric, recall of best-regressed samples, to quantitively evaluate the misalignment problem.
Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description, which is domain invariant to different tracking scenes.
Current strategies use a decoupled approach of single-step retrosynthesis models and search algorithms, taking only the product as the input to predict the reactants for each planning step and ignoring valuable context information along the synthetic route.
However, the analysis of implicit denoising effect in graph neural networks remains open.
Many point-based 3D detectors adopt point-feature sampling strategies to drop some points for efficient inference.
In this paper, we explore the performance of real time models on this metric and endow the models with the capacity of predicting the future, significantly improving the results for streaming perception.
To date, the most powerful semi-supervised object detectors (SS-OD) are based on pseudo-boxes, which need a sequence of post-processing with fine-tuned hyper-parameters.
In this paper, instead of searching trade-offs between accuracy and speed like previous works, we point out that endowing real-time models with the ability to predict the future is the key to dealing with this problem.
Ranked #1 on Real-Time Object Detection on Argoverse-HD (Full-Stack, Val) (sAP metric, using extra training data)
To address this, we propose a simple and efficient data augmentation strategy, local augmentation, to learn the distribution of the node features of the neighbors conditioned on the central node's feature and enhance GNN's expressive power with generated features.
In this report, we introduce our real-time 2D object detection system for the realistic autonomous driving scenario.
In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX.
Ranked #1 on Real-Time Object Detection on Argoverse-HD (Detection-Only, Val) (using extra training data)
Recent advances in label assignment in object detection mainly seek to independently define positive/negative training samples for each ground-truth (gt) object.
Ranked #73 on Object Detection on COCO test-dev
The teacher's weight is a momentum update of the student, and the teacher's BN statistics is a momentum update of those in history.
A joint loss is then defined as the weighted summation of cls and reg losses as the assigning indicator.
Our Faster R-CNN (ResNet50-FPN) baseline achieves 39. 8% mAP on COCO, which is on par with the state of the art self-supervised methods pre-trained on ImageNet.
In this paper, We propose a simple and efficient operator called Border-Align to extract "border features" from the extreme point of the border to enhance the point feature.
Few-shot object detection (FSOD) helps detectors adapt to unseen classes with few training instances, and is useful when manual annotation is time-consuming or data acquisition is limited.
Ranked #16 on Few-Shot Object Detection on MS-COCO (30-shot)
During training, to both satisfy the prior distribution of data and adapt to category characteristics, we present Center Weighting to adjust the category-specific prior distributions.
Thanks to this coarse-to-fine feature adaptation, domain knowledge in foreground regions can be effectively transferred.
Pyramidal feature representation is the common practice to address the challenge of scale variation in object detection.
Ranked #148 on Object Detection on COCO test-dev
Graph Convolution Network (GCN) has been recognized as one of the most effective graph models for semi-supervised learning, but it extracts merely the first-order or few-order neighborhood information through information propagation, which suffers performance drop-off for deeper structure.
Current top-performing object detectors depend on deep CNN backbones, such as ResNet-101 and Inception, benefiting from their powerful feature representations but suffering from high computational costs.