Search Results for author: Jihyung Kil

Found 7 papers, 4 papers with code

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering

no code implementations16 Feb 2024 Jihyung Kil, Farideh Tavazoee, Dongyeop Kang, Joo-Kyung Kim

II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i. e., visual or beyond-visual) of reasoning are required to answer the question.

Question Answering Visual Question Answering

Dual-View Visual Contextualization for Web Navigation

no code implementations6 Feb 2024 Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao

Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites.

GPT-4V(ision) is a Generalist Web Agent, if Grounded

1 code implementation3 Jan 2024 Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering.

Image Captioning Question Answering +1

PreSTU: Pre-Training for Scene-Text Understanding

no code implementations ICCV 2023 Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective.

Image Captioning Optical Character Recognition (OCR) +2

One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones

1 code implementation CVPR 2022 Chan Hee Song, Jihyung Kil, Tai-Yu Pan, Brian M. Sadler, Wei-Lun Chao, Yu Su

We study the problem of developing autonomous agents that can follow human instructions to infer and perform a sequence of actions to complete the underlying task.

Vision and Language Navigation

Cannot find the paper you are looking for? You can Submit a new open access paper.