Detr, or Detection Transformer, is a set-based object detector using a Transformer on top of a convolutional backbone. It uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a “no object” class.
Source: End-to-End Object Detection with TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Object Detection | 88 | 38.10% |
Semantic Segmentation | 9 | 3.90% |
Instance Segmentation | 9 | 3.90% |
Image Classification | 6 | 2.60% |
Few-Shot Object Detection | 5 | 2.16% |
Panoptic Segmentation | 5 | 2.16% |
Semi-Supervised Object Detection | 4 | 1.73% |
Self-Supervised Learning | 4 | 1.73% |
Real-Time Object Detection | 3 | 1.30% |
Component | Type |
|
---|---|---|
![]() |
Convolutions | |
![]() |
Feedforward Networks | |
![]() |
Convolutional Neural Networks | (optional) |
![]() |
Transformers |