LV-ViT is a type of vision transformer that uses token labelling as a training objective. Different from the standard training objective of ViTs that computes the classification loss on an additional trainable class token, token labelling takes advantage of all the image patch tokens to compute the training loss in a dense manner. Specifically, token labeling reformulates the image classification problem into multiple token-level recognition problems and assigns each patch token with an individual location-specific supervision generated by a machine annotator.
Source: All Tokens Matter: Token Labeling for Training Better Vision TransformersPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
Image Classification | 3 | 27.27% |
Efficient ViTs | 2 | 18.18% |
Computational Efficiency | 1 | 9.09% |
Token Reduction | 1 | 9.09% |
Analogical Similarity | 1 | 9.09% |
Action Recognition | 1 | 9.09% |
General Classification | 1 | 9.09% |
Semantic Segmentation | 1 | 9.09% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |