Search Results for author: Zhecan Wang

Found 18 papers, 7 papers with code

Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks

no code implementations CSRR (ACL) 2022 Yue Wan, Yueen Ma, Haoxuan You, Zhecan Wang, Shih-Fu Chang

Large-scale visual-linguistic pre-training aims to capture the generic representations from multimodal features, which are essential for downstream vision-language tasks.

Diversity Informativeness +2

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

1 code implementation19 Sep 2024 Zhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani AlOmari, Anushka Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You, Alvi Ishmam, Kai-Wei Chang, Shih-Fu Chang, Chris Thomas

In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors.

Hallucination Image Captioning +3

HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

1 code implementation22 Jul 2024 Zhecan Wang, Garrett Bingham, Adams Yu, Quoc Le, Thang Luong, Golnaz Ghiasi

Our results discover that benchmarking with generated images is highly correlated (r=0. 97) with real images.

Benchmarking Hallucination +3

Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions

no code implementations18 May 2024 Junzhang Liu, Zhecan Wang, Hammad Ayyubi, Haoxuan You, Chris Thomas, Rui Sun, Shih-Fu Chang, Kai-Wei Chang

Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context.

Visual Question Answering (VQA)

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

1 code implementation3 Jul 2023 Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

However, we find visual and textual fine-grained information, e. g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding.

Image-text matching Sentence +2

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

2 code implementations24 May 2023 Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer.

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

no code implementations14 Dec 2022 Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang

We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions.

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

no code implementations10 Nov 2022 Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang, Shih-Fu Chang

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described.

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

no code implementations22 Apr 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.

Question Answering Visual Commonsense Reasoning +2

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

no code implementations15 Jan 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.

Question Answering Visual Commonsense Reasoning +2

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

no code implementations16 Dec 2021 Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph.

Visual Commonsense Reasoning

Graph-MLP: Node Classification without Message Passing in Graph

1 code implementation8 Jun 2021 Yang Hu, Haoxuan You, Zhecan Wang, Zhicheng Wang, Erjin Zhou, Yue Gao

Graph Neural Network (GNN) has been demonstrated its effectiveness in dealing with non-Euclidean structural data.

Classification Graph Neural Network +1

Learning Visual Commonsense for Robust Scene Graph Generation

2 code implementations ECCV 2020 Alireza Zareian, Zhecan Wang, Haoxuan You, Shih-Fu Chang

Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild.

Graph Generation Scene Graph Generation +1

Learning to Detect Head Movement in Unconstrained Remote Gaze Estimation in the Wild

no code implementations7 Apr 2020 Zhecan Wang, Jian Zhao, Cheng Lu, Han Huang, Fan Yang, Lianji Li, Yandong Guo

To better demonstrate the advantage of our methods, we further propose a new benchmark dataset with the most rich distribution of head-gaze combination reflecting real-world scenarios.

Gaze Estimation

Cannot find the paper you are looking for? You can Submit a new open access paper.