no code implementations • CSRR (ACL) 2022 • Yue Wan, Yueen Ma, Haoxuan You, Zhecan Wang, Shih-Fu Chang
Large-scale visual-linguistic pre-training aims to capture the generic representations from multimodal features, which are essential for downstream vision-language tasks.
1 code implementation • 22 Nov 2024 • Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang
Video large language models (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual tokens generated from the video inputs.
no code implementations • 9 Oct 2024 • Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, BoWen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, Yinfei Yang
First, as there is a lack of QA data for egocentric video understanding, we develop a data engine that efficiently generates 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long, based on human-annotated data.
no code implementations • 30 Sep 2024 • Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, BoWen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, ZiRui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
We present MM1. 5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning.
Ranked #55 on Visual Question Answering on MM-Vet
1 code implementation • 19 Sep 2024 • Zhecan Wang, Junzhang Liu, Chia-Wei Tang, Hani AlOmari, Anushka Sivakumar, Rui Sun, Wenhao Li, Md. Atabuzzaman, Hammad Ayyubi, Haoxuan You, Alvi Ishmam, Kai-Wei Chang, Shih-Fu Chang, Chris Thomas
In this paper, we release JourneyBench, a comprehensive human-annotated benchmark of generated images designed to assess the model's fine-grained multimodal reasoning abilities across five tasks: complementary multimodal chain of thought, multi-image VQA, imaginary image captioning, VQA with hallucination triggers, and fine-grained retrieval with sample-specific distractors.
no code implementations • 18 May 2024 • Junzhang Liu, Zhecan Wang, Hammad Ayyubi, Haoxuan You, Chris Thomas, Rui Sun, Shih-Fu Chang, Kai-Wei Chang
Despite the widespread adoption of Vision-Language Understanding (VLU) benchmarks such as VQA v2, OKVQA, A-OKVQA, GQA, VCR, SWAG, and VisualCOMET, our analysis reveals a pervasive issue affecting their integrity: these benchmarks contain samples where answers rely on assumptions unsupported by the provided context.
1 code implementation • 11 Apr 2024 • Haotian Zhang, Haoxuan You, Philipp Dufter, BoWen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks.
Ranked #150 on Visual Question Answering on MM-Vet
no code implementations • 16 Mar 2024 • Jingping Nie, Hanya Shao, Yuang Fan, Qijia Shao, Haoxuan You, Matthias Preindl, Xiaofan Jiang
Despite the global mental health crisis, access to screenings, professionals, and treatments remains high.
no code implementations • 23 Oct 2023 • Zhecan Wang, Long Chen, Haoxuan You, Keyang Xu, Yicheng He, Wenhao Li, Noel Codella, Kai-Wei Chang, Shih-Fu Chang
Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions.
2 code implementations • 11 Oct 2023 • Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, BoWen Zhang, ZiRui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions.
1 code implementation • 3 Jul 2023 • Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang
However, we find visual and textual fine-grained information, e. g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding.
2 code implementations • 24 May 2023 • Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang
Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer.
no code implementations • 23 Mar 2023 • Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Baldridge, Jiahui Yu
The field of vision and language has witnessed a proliferation of pre-trained foundation models.
no code implementations • 14 Dec 2022 • Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang
We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions.
no code implementations • 10 Nov 2022 • Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang, Shih-Fu Chang
Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described.
1 code implementation • 26 Jul 2022 • Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan
Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space.
no code implementations • 22 Apr 2022 • Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-Fu Chang, Lu Yuan
Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.
Ranked #4 on Visual Question Answering (VQA) on VCR (Q-A) test
1 code implementation • ICLR 2022 • Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, Yun Fu
We notice that detailed local geometrical information probably is not the key to point cloud analysis -- we introduce a pure residual MLP network, called PointMLP, which integrates no sophisticated local geometrical extractors but still performs very competitively.
Ranked #4 on Point Cloud Segmentation on PointCloud-C
Point Cloud Segmentation Supervised Only 3D Point Cloud Classification
no code implementations • 15 Jan 2022 • Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan
Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.
no code implementations • 16 Dec 2021 • Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang
As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph.
no code implementations • 29 Sep 2021 • Haoxuan You, Luowei Zhou, Bin Xiao, Noel C Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan
Large-scale multimodal contrastive pretraining has demonstrated great utility to support high performance in a range of downstream tasks by mapping multiple modalities into a shared embedding space.
1 code implementation • 8 Jun 2021 • Yang Hu, Haoxuan You, Zhecan Wang, Zhicheng Wang, Erjin Zhou, Yue Gao
Graph Neural Network (GNN) has been demonstrated its effectiveness in dealing with non-Euclidean structural data.
1 code implementation • NAACL 2021 • Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, Kai-Wei Chang
Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks.
2 code implementations • ECCV 2020 • Alireza Zareian, Zhecan Wang, Haoxuan You, Shih-Fu Chang
Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild.
2 code implementations • NeurIPS 2019 • Can Qin, Haoxuan You, Lichen Wang, C. -C. Jay Kuo, Yun Fu
Specifically, most general-purpose DA methods that struggle for global feature alignment and ignore local geometric information are not suitable for 3D domain alignment.
Ranked #1 on Unsupervised Domain Adaptation on PreSIL to KITTI
no code implementations • ICCV 2019 • Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, Hongsheng Li
The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations.
3 code implementations • 30 Jul 2019 • Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, C. -C. Jay Kuo
In the attribute building stage, we address the problem of unordered point cloud data using a space partitioning procedure and developing a robust descriptor that characterizes the relationship between a point and its one-hop neighbor in a PointHop unit.
no code implementations • 13 Dec 2018 • Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven Hoi, Xiaogang Wang, Hongsheng Li
It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering.
no code implementations • 2 Dec 2018 • Haoxuan You, Yifan Feng, Xibin Zhao, Changqing Zou, Rongrong Ji, Yue Gao
More specifically, based on the relation score module, the point-single-view fusion feature is first extracted by fusing the point cloud feature and each single view feature with point-singe-view relation, then the point-multi-view fusion feature is extracted by fusing the point cloud feature and the features of different number of views with point-multi-view relation.
2 code implementations • 28 Nov 2018 • Yutong Feng, Yifan Feng, Haoxuan You, Xibin Zhao, Yue Gao
However, there is little effort on using mesh data in recent years, due to the complexity and irregularity of mesh data.
3 code implementations • 25 Sep 2018 • Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, Yue Gao
In this paper, we present a hypergraph neural networks (HGNN) framework for data representation learning, which can encode high-order data correlation in a hypergraph structure.
1 code implementation • 23 Aug 2018 • Haoxuan You, Yifan Feng, Rongrong Ji, Yue Gao
With the recent proliferation of deep learning, various deep models with different representations have achieved the state-of-the-art performance.
no code implementations • 28 Nov 2017 • Haoxuan You, Zhicheng Jiao, Haojun Xu, Jie Li, Ying Wang, Xinbo Gao
Generative adversarial network (GAN) has gotten wide re-search interest in the field of deep learning.