In this paper, instead of searching trade-offs between accuracy and speed like previous works, we point out that endowing real-time models with the ability to predict the future is the key to dealing with this problem.
Ranked #1 on Real-Time Object Detection on Argoverse-HD (Full-Stack, Val) (sAP metric, using extra training data)
Deep neural networks perform poorly on heavily class-imbalanced datasets.
Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
In particular, Panoptic FCN encodes each object instance or stuff category with the proposed kernel generator and produces the prediction by convolving the high-resolution feature directly.
In this report, we introduce our real-time 2D object detection system for the realistic autonomous driving scenario.
In this report, we present some experienced improvements to YOLO series, forming a new high-performance detector -- YOLOX.
Ranked #1 on Real-Time Object Detection on Argoverse-HD (Detection-Only, Val) (using extra training data)
Motivated by our discovery, we propose a unified distribution alignment strategy for long-tail visual recognition.
Ranked #12 on Long-tail Learning on Places-LT
Recent advances in label assignment in object detection mainly seek to independently define positive/negative training samples for each ground-truth (gt) object.
Ranked #48 on Object Detection on COCO test-dev
The teacher's weight is a momentum update of the student, and the teacher's BN statistics is a momentum update of those in history.
The Learnable Tree Filter presents a remarkable approach to model structure-preserving relations for semantic segmentation.
To this end, we propose a fine-grained dynamic head to conditionally select a pixel-level combination of FPN features from different scales for each instance, which further releases the ability of multi-scale feature representation.
In this paper, we present a conceptually simple, strong, and efficient framework for panoptic segmentation, called Panoptic FCN.
Ranked #1 on Panoptic Segmentation on Cityscapes val (PQst metric)
Our Faster R-CNN (ResNet50-FPN) baseline achieves 39. 8% mAP on COCO, which is on par with the state of the art self-supervised methods pre-trained on ImageNet.
In this report, we present our object detection/instance segmentation system, MegDetV2, which works in a two-pass fashion, first to detect instances then to obtain segmentation.
In this paper, we propose a method, named EqCo (Equivalent Rules for Contrastive Learning), to make self-supervised learning irrelevant to the number of negative samples in InfoNCE-based contrastive learning frameworks.
In this paper, We propose a simple and efficient operator called Border-Align to extract "border features" from the extreme point of the border to enhance the point feature.
During training, to both satisfy the prior distribution of data and adapt to category characteristics, we present Center Weighting to adjust the category-specific prior distributions.
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
To demonstrate the superiority of the dynamic property, we compare with several static architectures, which can be modeled as special cases in the routing space.
To this end, tree filtering modules are embedded to formulate a unified framework for semantic segmentation.
This report presents our method which wins the nuScenes3D Detection Challenge  held in Workshop on Autonomous Driving(WAD, CVPR 2019).
Ranked #171 on 3D Object Detection on nuScenes
In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two-stage detector named ThunderNet.
Ranked #13 on Object Detection on PASCAL VOC 2007
(1) Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales.
Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection.
More importantly, simply replacing the backbone with a tiny network (e. g, Xception), our Light-Head R-CNN gets 30. 7 mmAP at 102 FPS on COCO, significantly outperforming the single-stage, fast detectors like YOLO and SSD on both speed and accuracy.
The improvements in recent CNN-based object detection works, from R-CNN , Fast/Faster R-CNN [10, 31] to recent Mask R-CNN  and RetinaNet , mainly come from new network, new framework, or novel loss design.