1 code implementation • 11 Dec 2023 • Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh
In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities.
Ranked #1 on Science Question Answering on ScienceQA (using extra training data)
1 code implementation • 24 Oct 2023 • Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim
We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA).
Ranked #1 on Video Question Answering on TVQA
no code implementations • 5 Sep 2023 • TaeHoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bohyung Han, Kyoung Mu Lee, Honglak Lee, Kyounghoon Bae, Xiangyu Wu, Yi Gao, Hailiang Zhang, Yang Yang, Weili Guo, Jianfeng Lu, Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim, Wooyoung Kang, Won Young Jhoo, Byungseok Roh, Jonghwan Mun, Solgil Oh, Kenan Emir Ak, Gwang-Gook Lee, Yan Xu, Mingwei Shen, Kyomin Hwang, Wonsik Shin, Kamin Lee, Wonhark Park, Dongkwan Lee, Nojun Kwak, Yujin Wang, Yimu Wang, Tiancheng Gu, Xingchang Lv, Mingmao Sun
In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge.
no code implementations • 23 Mar 2023 • Han-Cheol Cho, Won Young Jhoo, Wooyoung Kang, Byungseok Roh
Recent open-vocabulary detection methods aim to detect novel objects by distilling knowledge from vision-language models (VLMs) trained on a vast amount of image-text pairs.
1 code implementation • ICCV 2023 • Wooyoung Kang, Jonghwan Mun, Sungjun Lee, Byungseok Roh
Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model.
no code implementations • 19 Oct 2022 • Jihyeon Lee, Wooyoung Kang, Eun-Sol Kim
It is well known that most of the conventional video question answering (VideoQA) datasets consist of easy questions requiring simple reasoning processes.