Search Results for author: Yiwu Zhong

Found 11 papers, 9 papers with code

Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models

1 code implementation • 27 Mar 2024 • Yiwu Zhong, Zi-Yuan Hu, Michael R. Lyu, LiWei Wang

When visual tables serve as standalone visual representations, our model can closely match or even beat the SOTA MLLMs that are built on CLIP visual embeddings.

Ranked #37 on Visual Question Answering on MM-Vet

Representation Learning Visual Question Answering +2

Paper
Code

Towards Learning a Generalist Model for Embodied Navigation

2 code implementations • 4 Dec 2023 • Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, LiWei Wang

We conduct extensive experiments to evaluate the performance and generalizability of our model.

Ranked #1 on Visual Navigation on Cooperative Vision-and-Dialogue Navigation

3D Question Answering (3D-QA) Embodied Question Answering +3

Paper
Code

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

2 code implementations • 13 Nov 2023 • An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, JianFeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang

We first benchmark MM-Navigator on our collected iOS screen dataset.

Action Localization

106

Paper
Code

Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models

no code implementations • 4 Oct 2023 • An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, chengyu dong, Amilcare Gentili, Chun-Nan Hsu, Jingbo Shang, Julian McAuley

Medical image classification is a critical problem for healthcare, with the potential to alleviate the workload of doctors and facilitate diagnoses of patients.

Image Classification Language Modelling +1

Paper
Add Code

Learning Concise and Descriptive Attributes for Visual Recognition

1 code implementation • ICCV 2023 • An Yan, Yu Wang, Yiwu Zhong, chengyu dong, Zexue He, Yujie Lu, William Wang, Jingbo Shang, Julian McAuley

Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes.

Descriptive

Paper
Code

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations

1 code implementation • CVPR 2023 • Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, Yin Li

In this work, we propose to learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations, without using human annotations.

Paper
Code

RegionCLIP: Region-based Language-Image Pretraining

1 code implementation • CVPR 2022 • Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao

However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans.

Ranked #11 on Open Vocabulary Object Detection on MSCOCO (using extra training data)

Image Classification Object +3

646

Paper
Code

Grounded Language-Image Pre-training

2 code implementations • CVPR 2022 • Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao

The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich.

Ranked #1 on 2D Object Detection on RF100

Described Object Detection Few-Shot Object Detection +1

1,967

Paper
Code

Learning to Generate Scene Graph from Natural Language Supervision

1 code implementation • ICCV 2021 • Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li

To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.

Graph Generation Scene Graph Generation +1

Paper
Code

A Simple Baseline for Weakly-Supervised Scene Graph Generation

no code implementations • ICCV 2021 • Jing Shi, Yiwu Zhong, Ning Xu, Yin Li, Chenliang Xu

We investigate the weakly-supervised scene graph generation, which is a challenging task since no correspondence of label and object is provided.

Contrastive Learning Graph Generation +2

Paper
Add Code

Comprehensive Image Captioning via Scene Graph Decomposition

1 code implementation • ECCV 2020 • Yiwu Zhong, Li-Wei Wang, Jianshu Chen, Dong Yu, Yin Li

We address the challenging problem of image captioning by revisiting the representation of image scene graph.

Image Captioning Sentence

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.