Search Results for author: Ronghang Hu

Found 25 papers, 17 papers with code

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

10 code implementations CVPR 2023 Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie

This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation.

Object Detection Representation Learning +2

Scaling Language-Image Pre-training via Masking

4 code implementations CVPR 2023 Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP.

Exploring Long-Sequence Masked Autoencoders

1 code implementation13 Oct 2022 Ronghang Hu, Shoubhik Debnath, Saining Xie, Xinlei Chen

Masked Autoencoding (MAE) has emerged as an effective approach for pre-training representations across multiple domains.

Object Detection Segmentation +1

FLAVA: A Foundational Language And Vision Alignment Model

3 code implementations CVPR 2022 Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Image Retrieval Image-to-Text Retrieval +3

UniT: Multimodal Multitask Learning with a Unified Transformer

1 code implementation ICCV 2021 Ronghang Hu, Amanpreet Singh

We propose UniT, a Unified Transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to natural language understanding and multimodal reasoning.

Multimodal Reasoning Multi-Task Learning +4

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

no code implementations ACL 2019 Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko

The actual grounding can connect language to the environment through multiple modalities, e. g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route.

Vision and Language Navigation

Language-Conditioned Graph Networks for Relational Reasoning

1 code implementation ICCV 2019 Ronghang Hu, Anna Rohrbach, Trevor Darrell, Kate Saenko

E. g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction.

Object Referring Expression Comprehension +2

Grounding Visual Explanations

no code implementations ECCV 2018 Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata

Our model improves the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image.

General Classification Sentence

Explainable Neural Computation via Stack Neural Module Networks

1 code implementation ECCV 2018 Ronghang Hu, Jacob Andreas, Trevor Darrell, Kate Saenko

In complex inferential tasks like question answering, machine learning models must confront two challenges: the need to implement a compositional reasoning process, and, in many applications, the need for this reasoning process to be interpretable to assist users in both development and prediction.

Decision Making Question Answering +1

Generating Counterfactual Explanations with Natural Language

no code implementations26 Jun 2018 Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata

We call such textual explanations counterfactual explanations, and propose an intuitive method to generate counterfactual explanations by inspecting which evidence in an input is missing, but might contribute to a different classification decision if present in the image.

Classification counterfactual +2

Speaker-Follower Models for Vision-and-Language Navigation

1 code implementation NeurIPS 2018 Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell

We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction.

Data Augmentation Vision and Language Navigation

Learning to Segment Every Thing

3 code implementations CVPR 2018 Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, Ross Girshick

Most methods for object instance segmentation require all training examples to be labeled with segmentation masks.

Instance Segmentation Segmentation +1

Grounding Visual Explanations (Extended Abstract)

no code implementations17 Nov 2017 Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, Zeynep Akata

Existing models which generate textual explanations enforce task relevance through a discriminative term loss function, but such mechanisms only weakly constrain mentioned object parts to actually be present in the image.

Attribute Test

Modeling Relationships in Referential Expressions with Compositional Modular Networks

2 code implementations CVPR 2017 Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko

In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.

Visual Question Answering (VQA)

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

no code implementations30 Aug 2016 Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell

Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression.

Image Captioning Image Segmentation +3

Segmentation from Natural Language Expressions

3 code implementations20 Mar 2016 Ronghang Hu, Marcus Rohrbach, Trevor Darrell

To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.

Referring Expression Segmentation Segmentation +1

Natural Language Object Retrieval

1 code implementation CVPR 2016 Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, Trevor Darrell

In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.

Image Captioning Image Retrieval +4

Grounding of Textual Phrases in Images by Reconstruction

3 code implementations12 Nov 2015 Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Language Modelling Natural Language Visual Grounding +3

Spatial Semantic Regularisation for Large Scale Object Detection

no code implementations ICCV 2015 Damian Mrowca, Marcus Rohrbach, Judy Hoffman, Ronghang Hu, Kate Saenko, Trevor Darrell

Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations.

Clustering Object +2

Cannot find the paper you are looking for? You can Submit a new open access paper.