Phrase Grounding
36 papers with code • 5 benchmarks • 6 datasets
Given an image and a corresponding caption, the Phrase Grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image.
Source: Phrase Grounding by Soft-Label Chain Conditional Random Field
Libraries
Use these libraries to find Phrase Grounding models and implementationsLatest papers
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpass the performance of task-specific experts in complex and dense scenarios.
An Open and Comprehensive Pipeline for Unified Object Grounding and Detection
Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data.
Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation for Grounding-Based Vision and Language Models
While we demonstrate our data augmentation method with MDETR framework, the proposed approach is applicable to common grounding-based vision and language tasks with other frameworks.
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks
It has been established that training a box-based detector network can enhance the localization performance of weakly supervised and unsupervised methods.
A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions.
A Survey on Interpretable Cross-modal Reasoning
In recent years, cross-modal reasoning (CMR), the process of understanding and reasoning across different modalities, has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics.
Kosmos-2: Grounding Multimodal Large Language Models to the World
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e. g., bounding boxes) and grounding text to the visual world.
Trade-offs in Fine-tuned Diffusion Models Between Accuracy and Interpretability
Recent advancements in diffusion models have significantly impacted the trajectory of generative machine learning research, with many adopting the strategy of fine-tuning pre-trained models using domain-specific text-to-image datasets.