Search Results for author: Haoxuan You

Found 27 papers, 12 papers with code

Ferret: Refer and Ground Anything Anywhere at Any Granularity

1 code implementation11 Oct 2023 Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, BoWen Zhang, ZiRui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions.

Hallucination Language Modelling +1

Hypergraph Neural Networks

2 code implementations25 Sep 2018 Yifan Feng, Haoxuan You, Zizhao Zhang, Rongrong Ji, Yue Gao

In this paper, we present a hypergraph neural networks (HGNN) framework for data representation learning, which can encode high-order data correlation in a hypergraph structure.

Object Recognition Representation Learning

Rethinking Network Design and Local Geometry in Point Cloud: A Simple Residual MLP Framework

1 code implementation ICLR 2022 Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, Yun Fu

We notice that detailed local geometrical information probably is not the key to point cloud analysis -- we introduce a pure residual MLP network, called PointMLP, which integrates no sophisticated local geometrical extractors but still performs very competitively.

3D Point Cloud Classification Point Cloud Segmentation

MeshNet: Mesh Neural Network for 3D Shape Representation

2 code implementations28 Nov 2018 Yutong Feng, Yifan Feng, Haoxuan You, Xibin Zhao, Yue Gao

However, there is little effort on using mesh data in recent years, due to the complexity and irregularity of mesh data.

3D Shape Classification 3D Shape Representation +2

PointDAN: A Multi-Scale 3D Domain Adaption Network for Point Cloud Representation

2 code implementations NeurIPS 2019 Can Qin, Haoxuan You, Lichen Wang, C. -C. Jay Kuo, Yun Fu

Specifically, most general-purpose DA methods that struggle for global feature alignment and ignore local geometric information are not suitable for 3D domain alignment.

Unsupervised Domain Adaptation

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

1 code implementation26 Jul 2022 Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan

Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space.

Graph-MLP: Node Classification without Message Passing in Graph

1 code implementation8 Jun 2021 Yang Hu, Haoxuan You, Zhecan Wang, Zhicheng Wang, Erjin Zhou, Yue Gao

Graph Neural Network (GNN) has been demonstrated its effectiveness in dealing with non-Euclidean structural data.

Classification Node Classification

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

1 code implementation24 May 2023 Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer.

PointHop: An Explainable Machine Learning Method for Point Cloud Classification

3 code implementations30 Jul 2019 Min Zhang, Haoxuan You, Pranav Kadam, Shan Liu, C. -C. Jay Kuo

In the attribute building stage, we address the problem of unordered point cloud data using a space partitioning procedure and developing a robust descriptor that characterizes the relationship between a point and its one-hop neighbor in a PointHop unit.

Attribute BIG-bench Machine Learning +3

Learning Visual Commonsense for Robust Scene Graph Generation

2 code implementations ECCV 2020 Alireza Zareian, Zhecan Wang, Haoxuan You, Shih-Fu Chang

Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild.

Graph Generation Scene Graph Generation +1

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

1 code implementation3 Jul 2023 Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

However, we find visual and textual fine-grained information, e. g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding.

Image-text matching Sentence +2

Restricting Greed in Training of Generative Adversarial Network

no code implementations28 Nov 2017 Haoxuan You, Zhicheng Jiao, Haojun Xu, Jie Li, Ying Wang, Xinbo Gao

Generative adversarial network (GAN) has gotten wide re-search interest in the field of deep learning.

Generative Adversarial Network

PVNet: A Joint Convolutional Network of Point Cloud and Multi-View for 3D Shape Recognition

no code implementations23 Aug 2018 Haoxuan You, Yifan Feng, Rongrong Ji, Yue Gao

With the recent proliferation of deep learning, various deep models with different representations have achieved the state-of-the-art performance.

3D Object Recognition 3D Shape Classification +3

PVRNet: Point-View Relation Neural Network for 3D Shape Recognition

no code implementations2 Dec 2018 Haoxuan You, Yifan Feng, Xibin Zhao, Changqing Zou, Rongrong Ji, Yue Gao

More specifically, based on the relation score module, the point-single-view fusion feature is first extracted by fusing the point cloud feature and each single view feature with point-singe-view relation, then the point-multi-view fusion feature is extracted by fusing the point cloud feature and the features of different number of views with point-multi-view relation.

3D Shape Classification 3D Shape Recognition +3

Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

no code implementations13 Dec 2018 Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven Hoi, Xiaogang Wang, Hongsheng Li

It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering.

Question Answering Visual Question Answering

Multi-modality Latent Interaction Network for Visual Question Answering

no code implementations ICCV 2019 Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, Hongsheng Li

The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations.

Language Modelling Question Answering +1

MA-CLIP: Towards Modality-Agnostic Contrastive Language-Image Pre-training

no code implementations29 Sep 2021 Haoxuan You, Luowei Zhou, Bin Xiao, Noel C Codella, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan

Large-scale multimodal contrastive pretraining has demonstrated great utility to support high performance in a range of downstream tasks by mapping multiple modalities into a shared embedding space.

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

no code implementations16 Dec 2021 Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph.

Visual Commonsense Reasoning

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

no code implementations15 Jan 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.

Question Answering Visual Commonsense Reasoning +2

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

no code implementations22 Apr 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.

Question Answering Visual Commonsense Reasoning +2

Bridging the Gap between Recognition-level Pre-training and Commonsensical Vision-language Tasks

no code implementations CSRR (ACL) 2022 Yue Wan, Yueen Ma, Haoxuan You, Zhecan Wang, Shih-Fu Chang

Large-scale visual-linguistic pre-training aims to capture the generic representations from multimodal features, which are essential for downstream vision-language tasks.

Informativeness Type prediction +1

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

no code implementations10 Nov 2022 Zhecan Wang, Haoxuan You, Yicheng He, Wenhao Li, Kai-Wei Chang, Shih-Fu Chang

Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described.

Find Someone Who: Visual Commonsense Understanding in Human-Centric Grounding

no code implementations14 Dec 2022 Haoxuan You, Rui Sun, Zhecan Wang, Kai-Wei Chang, Shih-Fu Chang

We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions.

Cannot find the paper you are looking for? You can Submit a new open access paper.