7 code implementations • 1 Aug 2024 • Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer
We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos.
Ranked #2 on Visual Object Tracking on VOT2022
15 code implementations • CVPR 2023 • Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation.
Ranked #46 on Semantic Segmentation on ADE20K
no code implementations • ICCV 2023 • Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, Angel X. Chang
Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships.
5 code implementations • CVPR 2023 • Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP.
1 code implementation • 13 Oct 2022 • Ronghang Hu, Shoubhik Debnath, Saining Xie, Xinlei Chen
Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains.
4 code implementations • CVPR 2022 • Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.
Ranked #4 on Image Retrieval on MS COCO
1 code implementation • ICCV 2021 • Ronghang Hu, Amanpreet Singh
We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.
1 code implementation • ICCV 2021 • Ronghang Hu, Nikhila Ravi, Alexander C. Berg, Deepak Pathak
We present Worldsheet, a method for novel view synthesis using just a single RGB image as input.
no code implementations • ECCV 2020 • Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
Image descriptions can help visually impaired people to quickly understand the image content.
1 code implementation • CVPR 2020 • Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach
Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question.
no code implementations • ACL 2019 • Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko
The actual grounding can connect language to the environment through multiple modalities, e. g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route.
1 code implementation • ICCV 2019 • Ronghang Hu, Anna Rohrbach, Trevor Darrell, Kate Saenko
E. g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction.
Ranked #3 on Referring Expression Comprehension on CLEVR-Ref+
no code implementations • ECCV 2018 • Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata
Our model improves the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image.
1 code implementation • ECCV 2018 • Ronghang Hu, Jacob Andreas, Trevor Darrell, Kate Saenko
In complex inferential tasks like question answering, machine learning models must confront two challenges: the need to implement a compositional reasoning process, and, in many applications, the need for this reasoning process to be interpretable to assist users in both development and prediction.
Ranked #14 on Referring Expression Comprehension on Talk2Car
no code implementations • 26 Jun 2018 • Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata
We call such textual explanations counterfactual explanations, and propose an intuitive method to generate counterfactual explanations by inspecting which evidence in an input is missing, but might contribute to a different classification decision if present in the image.
1 code implementation • NeurIPS 2018 • Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell
We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction.
3 code implementations • CVPR 2018 • Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, Ross Girshick
Most methods for object instance segmentation require all training examples to be labeled with segmentation masks.
no code implementations • 17 Nov 2017 • Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata
Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image.
1 code implementation • ICCV 2017 • Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
Ranked #44 on Visual Question Answering (VQA) on VQA v2 test-dev
2 code implementations • CVPR 2017 • Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko
In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.
Ranked #1 on Visual Question Answering (VQA) on Visual7W
no code implementations • 30 Aug 2016 • Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell
Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression.
4 code implementations • 20 Mar 2016 • Ronghang Hu, Marcus Rohrbach, Trevor Darrell
To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.
Ranked #16 on Referring Expression Segmentation on J-HMDB
1 code implementation • CVPR 2016 • Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, Trevor Darrell
In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.
Ranked #12 on Referring Expression Comprehension on Talk2Car
3 code implementations • 12 Nov 2015 • Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Ranked #13 on Phrase Grounding on Flickr30k Entities Test
no code implementations • ICCV 2015 • Damian Mrowca, Marcus Rohrbach, Judy Hoffman, Ronghang Hu, Kate Saenko, Trevor Darrell
Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations.
1 code implementation • NeurIPS 2014 • Judy Hoffman, Sergio Guadarrama, Eric Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, Kate Saenko
A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories.