Visual Prompt Tuning
19 papers with code • 4 benchmarks • 0 datasets
Visual Prompt Tuning(VPT) only introduces a small amount of task-specific learnable parameters into the input space while freezing the entire pre-trained Transformer backbone during downstream training. In practice, these additional parameters are simply prepended into the input sequence of each Transformer layer and learned together with a linear head during fine-tuning. VPT is especially effective in the low-data regime, and maintains its advantage across data scales. Finally, VPT is competitive for a range of Transformer scales and designs (ViTBase/Large/Huge, Swin). Put together, the results suggest that VPT is one of the most effective ways of adapting ever-growing vision backbones.
Most implemented papers
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning.
Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning
In addition, to alleviate the class imbalance problem, we introduce a new gradient similarity-based focal loss and adaptive feature scaling to ease overfitting to the major classes and underfitting to the minor classes.
Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning
Existing Generalized FL (GFL) and Personalized FL (PFL) methods have limitations in balancing performance across both global and local data distributions.
TSP-Transformer: Task-Specific Prompts Boosted Transformer for Holistic Scene Understanding
Holistic scene understanding includes semantic segmentation, surface normal estimation, object boundary detection, depth estimation, etc.
SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt
Typical methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP, which represents an input image as a flattened sequence of token embeddings and then learns a set of unordered parameterized tokens prefixed to the sequence representation as the visual prompts for task adaptation of large vision models.
Revisiting the Power of Prompt for Visual Tuning
Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes.
CoLLaVO: Crayon Large Language and Vision mOdel
Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks.
CoDA: Instructive Chain-of-Domain Adaptation with Severity-Aware Visual Prompt Tuning
SAVPT features a novel metric Severity that divides all adverse scene images into low-severity and high-severity images.
ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning
Panoptic segmentation, combining semantic and instance segmentation, stands as a cutting-edge computer vision task.