1 code implementation • 27 Mar 2024 • Yiwu Zhong, Zi-Yuan Hu, Michael R. Lyu, LiWei Wang
When visual tables serve as standalone visual representations, our model can closely match or even beat the SOTA MLLMs that are built on CLIP visual embeddings.
Ranked #37 on Visual Question Answering on MM-Vet
2 code implementations • 4 Dec 2023 • Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, LiWei Wang
We conduct extensive experiments to evaluate the performance and generalizability of our model.
3D Question Answering (3D-QA) Embodied Question Answering +3
2 code implementations • 13 Nov 2023 • An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, JianFeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
We first benchmark MM-Navigator on our collected iOS screen dataset.
no code implementations • 4 Oct 2023 • An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, chengyu dong, Amilcare Gentili, Chun-Nan Hsu, Jingbo Shang, Julian McAuley
Medical image classification is a critical problem for healthcare, with the potential to alleviate the workload of doctors and facilitate diagnoses of patients.
1 code implementation • ICCV 2023 • An Yan, Yu Wang, Yiwu Zhong, chengyu dong, Zexue He, Yujie Lu, William Wang, Jingbo Shang, Julian McAuley
Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes.
1 code implementation • CVPR 2023 • Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, Yin Li
In this work, we propose to learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations, without using human annotations.
1 code implementation • CVPR 2022 • Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao
However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans.
Ranked #11 on Open Vocabulary Object Detection on MSCOCO (using extra training data)
2 code implementations • CVPR 2022 • Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao
The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich.
Ranked #1 on 2D Object Detection on RF100
1 code implementation • ICCV 2021 • Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li
To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
no code implementations • ICCV 2021 • Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, Chenliang Xu
We investigate the weakly-supervised scene graph generation, which is a challenging task since no correspondence of label and object is provided.
1 code implementation • ECCV 2020 • Yiwu Zhong, Li-Wei Wang, Jianshu Chen, Dong Yu, Yin Li
We address the challenging problem of image captioning by revisiting the representation of image scene graph.