Search Results for author: Xin Eric Wang

Found 61 papers, 31 papers with code

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler

no code implementations ECCV 2020 Tsu-Jui Fu, Xin Eric Wang, Matthew F. Peterson,Scott T. Grafton, Miguel P. Eckstein, William Yang Wang

In particular, we present a model-agnostic adversarial path sampler (APS) that learns to sample challenging paths that force the navigator to improve based on the navigation performance.

counterfactual Counterfactual Reasoning +2

Agent S: An Open Agentic Framework that Uses Computers Like a Human

1 code implementation10 Oct 2024 Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang

We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks.

AI Agent Task Planning

Multimodal Situational Safety

no code implementations8 Oct 2024 Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, Xin Eric Wang

To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.

Instruction Following

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

no code implementations3 Oct 2024 Kaizhi Zheng, Xiaotong Chen, Xuehai He, Jing Gu, Linjie Li, Zhengyuan Yang, Kevin Lin, JianFeng Wang, Lijuan Wang, Xin Eric Wang

Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming.

3D scene Editing

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

1 code implementation17 Jul 2024 Gengze Zhou, Yicong Hong, Zun Wang, Xin Eric Wang, Qi Wu

Capitalizing on the remarkable advancements in Large Language Models (LLMs), there is a burgeoning initiative to harness LLMs for instruction following robotic navigation.

Instruction Following Vision and Language Navigation

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

1 code implementation27 Jun 2024 Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang

Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements.

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

1 code implementation12 Jun 2024 Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, JianFeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics.

counterfactual Future prediction +1

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

1 code implementation30 May 2024 Qianqi Yan, Xuehai He, Xiang Yue, Xin Eric Wang

This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions.

Medical Diagnosis Medical Visual Question Answering +3

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

no code implementations8 May 2024 Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang

Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps.

Text-to-Image Generation

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

no code implementations8 Apr 2024 Jing Gu, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Yilin Wang, Xin Eric Wang

Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image.

Image Generation Object

Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA

no code implementations29 Jan 2024 Yue Fan, Jing Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Xinze Guan, Xin Eric Wang

Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Multimodal Large Language Models (MLLMs) tested, even though humans can attain approximately 99% accuracy on these questions.

Benchmarking Image Comprehension +4

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

no code implementations9 Oct 2023 Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, Xin Eric Wang

For problems where the goal is to infer conclusions beyond image content, which we noted as visual commonsense inference (VCI), VLMs face difficulties, while LLMs, given sufficient visual evidence, can use commonsense to infer the answer well.

Image Captioning Visual Commonsense Reasoning

LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models

1 code implementation5 Oct 2023 Saaket Agashe, Yue Fan, Anthony Reyna, Xin Eric Wang

In this study, we introduce a new LLM-Coordination Benchmark aimed at a detailed analysis of LLMs within the context of Pure Coordination Games, where participating agents need to cooperate for the most gain.

Multiple-choice Question Answering

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

1 code implementation3 Oct 2023 Kaizhi Zheng, Xuehai He, Xin Eric Wang

The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding.

Image Generation multimodal generation +2

T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation

no code implementations1 Jun 2023 Jialu Wang, Xinyue Gabby Liu, Zonglin Di, Yang Liu, Xin Eric Wang

In this work, we seek to measure more complex human biases exist in the task of text-to-image generations.

Text-to-Image Generation

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

1 code implementation NeurIPS 2023 Weixi Feng, Wanrong Zhu, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang

When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness.

Indoor Scene Synthesis Text-to-Image Generation

R2H: Building Multimodal Navigation Helpers that Respond to Help Requests

no code implementations23 May 2023 Yue Fan, Jing Gu, Kaizhi Zheng, Xin Eric Wang

Intelligent navigation-helper agents are critical as they can navigate users in unknown areas through environmental awareness and conversational ability, serving as potential accessibility tools for individuals with disabilities.

Benchmarking Language Modeling +4

Collaborative Generative AI: Integrating GPT-k for Efficient Editing in Text-to-Image Generation

no code implementations18 May 2023 Wanrong Zhu, Xinyi Wang, Yujie Lu, Tsu-Jui Fu, Xin Eric Wang, Miguel Eckstein, William Yang Wang

We conduct a series of experiments to compare the common edits made by humans and GPT-k, evaluate the performance of GPT-k in prompting T2I, and examine factors that may influence this process.

Text Generation Text-to-Image Generation

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

1 code implementation NeurIPS 2023 Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, William Yang Wang

Existing automatic evaluation on text-to-image synthesis can only provide an image-text matching score, without considering the object-level compositionality, which results in poor correlation with human judgments.

Attribute Image Generation +2

Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

1 code implementation2 May 2023 Zhen Zhang, Jialu Wang, Xin Eric Wang

Extensive experiments on XTD and Multi30K datasets, covering 11 languages under zero-shot, few-shot, and full-dataset learning scenarios, show that our framework significantly reduces the multilingual disparities among languages and improves cross-lingual transfer results, especially in low-resource scenarios, while only keeping and fine-tuning an extremely small number of parameters compared to the full model (e. g., Our framework only requires 0. 16\% additional parameters of a full-model for each language in the few-shot learning scenario).

Cross-Lingual Transfer Few-Shot Learning +2

Multimodal Procedural Planning via Dual Text-Image Prompting

1 code implementation2 May 2023 Yujie Lu, Pan Lu, Zhiyu Chen, Wanrong Zhu, Xin Eric Wang, William Yang Wang

The key challenges of MPP are to ensure the informativeness, temporal coherence, and accuracy of plans across modalities.

Image to text Informativeness +1

Multimodal Graph Transformer for Multimodal Question Answering

no code implementations30 Apr 2023 Xuehai He, Xin Eric Wang

Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly.

Question Answering

ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

no code implementations30 Jan 2023 Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia Jin, Lise Getoor, Xin Eric Wang

Such object navigation tasks usually require large-scale training in visual environments with labeled objects, which generalizes poorly to novel objects in unknown environments.

Efficient Exploration Language Modeling +3

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

1 code implementation9 Dec 2022 Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang

In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions.

Attribute Image Generation

Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning

no code implementations27 Nov 2022 Yunchao Zhang, Zonglin Di, Kaiwen Zhou, Cihang Xie, Xin Eric Wang

However, since the local data is inaccessible to the server under federated learning, attackers may easily poison the training data of the local client to build a backdoor in the agent without notice.

Federated Learning Navigate +1

ComCLIP: Training-Free Compositional Image and Text Matching

1 code implementation25 Nov 2022 Kenan Jiang, Xuehai He, Ruize Xu, Xin Eric Wang

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text.

Image-text matching Image-text Retrieval +2

CPL: Counterfactual Prompt Learning for Vision and Language Models

no code implementations19 Oct 2022 Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang

Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP.

counterfactual Image-text Retrieval +1

Anticipating the Unseen Discrepancy for Vision and Language Navigation

no code implementations10 Sep 2022 Yujie Lu, Huiliang Zhang, Ping Nie, Weixi Feng, Wenda Xu, Xin Eric Wang, William Yang Wang

In this paper, we propose an Unseen Discrepancy Anticipating Vision and Language Navigation (DAVIS) that learns to generalize to unseen environments via encouraging test-time visual consistency.

Data Augmentation Decision Making +3

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

no code implementations28 Aug 2022 Kaizhi Zheng, Kaiwen Zhou, Jing Gu, Yue Fan, Jialu Wang, Zonglin Di, Xuehai He, Xin Eric Wang

Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc.

Action Generation Common Sense Reasoning +2

Understanding Instance-Level Impact of Fairness Constraints

1 code implementation30 Jun 2022 Jialu Wang, Xin Eric Wang, Yang Liu

A variety of fairness constraints have been proposed in the literature to mitigate group-level statistical bias.

Fairness

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

no code implementations17 Jun 2022 Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, Xin Eric Wang

We hope the new simulator and benchmark will facilitate future research on language-guided robotic manipulation.

Object

Neuro-Symbolic Procedural Planning with Commonsense Prompting

no code implementations6 Jun 2022 Yujie Lu, Weixi Feng, Wanrong Zhu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang

Procedural planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps.

Graph Sampling

Aerial Vision-and-Dialog Navigation

2 code implementations24 May 2022 Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, Xin Eric Wang

To this end, we introduce Aerial Vision-and-Dialog Navigation (AVDN), to navigate a drone via natural language conversation.

Navigate

Imagination-Augmented Natural Language Understanding

1 code implementation NAACL 2022 Yujie Lu, Wanrong Zhu, Xin Eric Wang, Miguel Eckstein, William Yang Wang

Human brains integrate linguistic and perceptual information simultaneously to understand natural language, and hold the critical ability to render imaginations.

Natural Language Understanding

Parameter-efficient Model Adaptation for Vision Transformers

3 code implementations29 Mar 2022 Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, Xin Eric Wang

In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classification task.

Benchmarking Classification +2

FedVLN: Privacy-preserving Federated Vision-and-Language Navigation

1 code implementation28 Mar 2022 Kaiwen Zhou, Xin Eric Wang

Data privacy is a central problem for embodied agents that can perceive the environment, communicate with humans, and act in the real world.

Privacy Preserving Vision and Language Navigation

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

1 code implementation CVPR 2022 Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, Xin Eric Wang

To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i. e., Charades-CG and ActivityNet-CG.

Diversity Semantic correspondence +1

Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

1 code implementation ACL 2022 Jing Gu, Eliana Stefani, Qi Wu, Jesse Thomason, Xin Eric Wang

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks.

Vision and Language Navigation

Relational Graph Learning for Grounded Video Description Generation

no code implementations2 Dec 2021 Wenqiao Zhang, Xin Eric Wang, Siliang Tang, Haizhou Shi, Haocheng Shi, Jun Xiao, Yueting Zhuang, William Yang Wang

Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description.

Graph Learning Hallucination +3

Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search

2 code implementations EMNLP 2021 Jialu Wang, Yang Liu, Xin Eric Wang

Internet search affects people's cognition of the world, so mitigating biases in search results and learning fair models is imperative for social good.

Image Retrieval Natural Language Queries

CUDA-GHR: Controllable Unsupervised Domain Adaptation for Gaze and Head Redirection

1 code implementation21 Jun 2021 Swati Jindal, Xin Eric Wang

However, adopting such generative models to new domains while maintaining their ability to provide fine-grained control over different image attributes, \eg, gaze and head pose directions, has been a challenging problem.

Benchmarking gaze redirection +3

Assessing Multilingual Fairness in Pre-trained Multimodal Representations

no code implementations Findings (ACL) 2022 Jialu Wang, Yang Liu, Xin Eric Wang

To answer these questions, we view language as the fairness recipient and introduce two new fairness notions, multilingual individual fairness and multilingual group fairness, for pre-trained multimodal models.

Fairness

ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation

no code implementations10 Jun 2021 Wanrong Zhu, Xin Eric Wang, An Yan, Miguel Eckstein, William Yang Wang

Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with text references.

nlg evaluation Text Generation

Language-Driven Image Style Transfer

1 code implementation1 Jun 2021 Tsu-Jui Fu, Xin Eric Wang, William Yang Wang

We propose contrastive language visual artist (CLVA) that learns to extract visual semantics from style instructions and accomplish LDAST by the patch-wise style discriminator.

Style Transfer

M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers

no code implementations CVPR 2022 Tsu-Jui Fu, Xin Eric Wang, Scott T. Grafton, Miguel P. Eckstein, William Yang Wang

LBVE contains two features: 1) the scenario of the source video is preserved instead of generating a completely different video; 2) the semantic is presented differently in the target video, and all changes are controlled by the given instruction.

Video Editing Video Understanding

L2C: Describing Visual Differences Needs Semantic Understanding of Individuals

no code implementations EACL 2021 An Yan, Xin Eric Wang, Tsu-Jui Fu, William Yang Wang

Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs.

Image Captioning

Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations

no code implementations EMNLP 2020 Wanrong Zhu, Xin Eric Wang, Pradyumna Narayana, Kazoo Sone, Sugato Basu, William Yang Wang

A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings.

Text Generation

SSCR: Iterative Language-Based Image Editing via Self-Supervised Counterfactual Reasoning

1 code implementation EMNLP 2020 Tsu-Jui Fu, Xin Eric Wang, Scott Grafton, Miguel Eckstein, William Yang Wang

In this paper, we introduce a Self-Supervised Counterfactual Reasoning (SSCR) framework that incorporates counterfactual thinking to overcome data scarcity.

counterfactual Counterfactual Reasoning

Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation

1 code implementation EACL 2021 Wanrong Zhu, Xin Eric Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana, Kazoo Sone, Sugato Basu, William Yang Wang

Outdoor vision-and-language navigation (VLN) is such a task where an agent follows natural language instructions and navigates a real-life urban environment.

Ranked #5 on Vision and Language Navigation on Touchdown Dataset (using extra training data)

Style Transfer Text Style Transfer +1

Environment-agnostic Multitask Learning for Natural Language Grounded Navigation

1 code implementation ECCV 2020 Xin Eric Wang, Vihan Jain, Eugene Ie, William Yang Wang, Zornitsa Kozareva, Sujith Ravi

Recent research efforts enable study for natural language grounded navigation in photo-realistic environments, e. g., following natural language instructions or dialog.

Vision-Language Navigation

Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling

no code implementations17 Nov 2019 Tsu-Jui Fu, Xin Eric Wang, Matthew Peterson, Scott Grafton, Miguel Eckstein, William Yang Wang

In particular, we present a model-agnostic adversarial path sampler (APS) that learns to sample challenging paths that force the navigator to improve based on the navigation performance.

counterfactual Counterfactual Reasoning +2

Cross-Lingual Vision-Language Navigation

2 code implementations24 Oct 2019 An Yan, Xin Eric Wang, Jiangtao Feng, Lei LI, William Yang Wang

Commanding a robot to navigate with natural language instructions is a long-term goal for grounded language understanding and robotics.

Domain Adaptation Navigate +2

Cannot find the paper you are looking for? You can Submit a new open access paper.