Search Results for author: Harsh Agrawal

Found 18 papers, 7 papers with code

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

no code implementations24 Oct 2024 Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan

Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation.

Diversity Language Modelling +3

Grounding Multimodal Large Language Models in Actions

no code implementations12 Jun 2024 Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev

For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance.

World Knowledge

Housekeep: Tidying Virtual Households using Commonsense Reasoning

1 code implementation22 May 2022 Yash Kant, Arun Ramachandran, Sriram Yenamandra, Igor Gilitschenski, Dhruv Batra, Andrew Szot, Harsh Agrawal

Instead, the agent must learn from and is evaluated against human preferences of which objects belong where in a tidy house.

Language Modelling Large Language Model

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

no code implementations NeurIPS 2021 Abhinav Moudgil, Arjun Majumdar, Harsh Agrawal, Stefan Lee, Dhruv Batra

Natural language instructions for visual navigation often use scene descriptions (e. g., "bedroom") and object references (e. g., "green chairs") to provide a breadcrumb trail to a goal location.

Object Scene Classification +2

Contrast and Classify: Training Robust VQA Models

1 code implementation ICCV 2021 Yash Kant, Abhinav Moudgil, Dhruv Batra, Devi Parikh, Harsh Agrawal

Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions.

Contrastive Learning Data Augmentation +4

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

no code implementations ICCV 2019 Jyoti Aneja, Harsh Agrawal, Dhruv Batra, Alexander Schwing

We encourage this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future.

Diversity Image Captioning +2

EvalAI: Towards Better Evaluation Systems for AI Agents

3 code implementations10 Feb 2019 Deshraj Yadav, Rishabh Jain, Harsh Agrawal, Prithvijit Chattopadhyay, Taranjeet Singh, Akash Jain, Shiv Baran Singh, Stefan Lee, Dhruv Batra

We introduce EvalAI, an open source platform for evaluating and comparing machine learning (ML) and artificial intelligence algorithms (AI) at scale.

Benchmarking BIG-bench Machine Learning

nocaps: novel object captioning at scale

2 code implementations ICCV 2019 Harsh Agrawal, Karan Desai, YuFei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, Peter Anderson

To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task.

Image Captioning Object +2

Fabrik: An Online Collaborative Neural Network Editor

no code implementations27 Oct 2018 Utsav Garg, Viraj Prabhu, Deshraj Yadav, Ram Ramrakhya, Harsh Agrawal, Dhruv Batra

We present Fabrik, an online neural network editor that provides tools to visualize, edit, and share neural networks from within a browser.

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

no code implementations17 Jun 2016 Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, Dhruv Batra

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images.

Question Answering Visual Question Answering

Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

no code implementations EMNLP 2016 Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, Dhruv Batra

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images.

Question Answering Visual Question Answering

Object-Proposal Evaluation Protocol is 'Gameable'

1 code implementation CVPR 2016 Neelima Chavali, Harsh Agrawal, Aroma Mahendru, Dhruv Batra

Finally, we plan to release an easy-to-use toolbox which combines various publicly available implementations of object proposal algorithms which standardizes the proposal generation and evaluation so that new methods can be added and evaluated on different datasets.

Object object-detection +2

Cannot find the paper you are looking for? You can Submit a new open access paper.