Search Results for author: Zhengyuan Yang

Found 66 papers, 36 papers with code

LiVOS: Light Video Object Segmentation with Gated Linear Matching

1 code implementation5 Nov 2024 Qin Liu, JianFeng Wang, Zhengyuan Yang, Linjie Li, Kevin Lin, Marc Niethammer, Lijuan Wang

Semi-supervised video object segmentation (VOS) has been largely driven by space-time memory (STM) networks, which store past frame features in a spatiotemporal memory to segment the current frame via softmax attention.

Semantic Segmentation Semi-Supervised Video Object Segmentation +1

GenXD: Generating Any 3D and 4D Scenes

no code implementations4 Nov 2024 Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, JianFeng Wang, Gim Hee Lee, Lijuan Wang

Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos.

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

no code implementations30 Oct 2024 Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, JianFeng Wang, Zhengyuan Yang, YingNian Wu, Lijuan Wang

Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module.

Video Generation

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

no code implementations4 Oct 2024 Zichen Miao, Zhengyuan Yang, Kevin Lin, Ze Wang, Zicheng Liu, Lijuan Wang, Qiang Qiu

We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data.

Image Generation Style Transfer

EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

no code implementations3 Oct 2024 Kaizhi Zheng, Xiaotong Chen, Xuehai He, Jing Gu, Linjie Li, Zhengyuan Yang, Kevin Lin, JianFeng Wang, Lijuan Wang, Xin Eric Wang

Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming.

3D scene Editing

AutoDirector: Online Auto-scheduling Agents for Multi-sensory Composition

no code implementations21 Aug 2024 Minheng Ni, Chenfei Wu, Huaying Yuan, Zhengyuan Yang, Ming Gong, Lijuan Wang, Zicheng Liu, WangMeng Zuo, Nan Duan

With the advancement of generative models, the synthesis of different sensory elements such as music, visuals, and speech has achieved significant realism.

Scheduling

IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation

1 code implementation15 Jul 2024 Yuanhao Zhai, Kevin Lin, Linjie Li, Chung-Ching Lin, JianFeng Wang, Zhengyuan Yang, David Doermann, Junsong Yuan, Zicheng Liu, Lijuan Wang

First, to enable dual-modal generation and maximize the information exchange between video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint video and depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables the mutual information flow.

Denoising Monocular Depth Estimation +2

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

no code implementations14 Jun 2024 Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks.

Video Editing

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

1 code implementation12 Jun 2024 Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, JianFeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang

Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics.

counterfactual Future prediction +1

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

1 code implementation25 Apr 2024 An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, JianFeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang

Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image.

Visual Grounding Visual Question Answering +1

StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis

no code implementations30 Jan 2024 Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan

To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model's ability to capture the true semantic representation of visual scenes.

Vector Graphics

Bring Metric Functions into Diffusion Models

no code implementations4 Jan 2024 Jie An, Zhengyuan Yang, JianFeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, Jiebo Luo

The first module, similar to a standard DDPM, learns to predict the added noise and is unaffected by the metric function.

Denoising

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

no code implementations1 Jan 2024 Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, JianFeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

\ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters.

Language Modelling Reading Comprehension +1

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning

no code implementations CVPR 2024 Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, Zicheng Liu

We also show the effectiveness of our RL fine-tuning framework on enhancing the diversity of image generation with different types of diffusion models including class-conditional models and text-conditional models e. g. StableDiffusion.

Decision Making Diversity +4

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

no code implementations21 Dec 2023 Bingbing Wen, Zhengyuan Yang, JianFeng Wang, Zhe Gan, Bill Howe, Lijuan Wang

In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content.

Interfacing Foundation Models' Embeddings

1 code implementation12 Dec 2023 Xueyan Zou, Linjie Li, JianFeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang

To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity.

Decoder Image Segmentation +3

GPT-4V(ision) as A Social Media Analysis Engine

1 code implementation13 Nov 2023 Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo

Our investigation begins with a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a careful review of the results and a selection of qualitative samples that illustrate GPT-4V's potential in understanding multimodal social media content.

Hallucination Hate Speech Detection +1

MM-VID: Advancing Video Understanding with GPT-4V(ision)

1 code implementation30 Oct 2023 Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, JianFeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang

We present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding.

Script Generation Video Understanding

DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design

1 code implementation23 Oct 2023 Kevin Lin, Zhengyuan Yang, Linjie Li, JianFeng Wang, Lijuan Wang

For DEsignBench benchmarking, we perform human evaluations on generated images in DEsignBench gallery, against the criteria of image-text alignment, visual aesthetic, and design creativity.

Benchmarking Image Generation

Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation

no code implementations12 Oct 2023 Zhengyuan Yang, JianFeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang

We introduce ``Idea to Image,'' a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation.

OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation

no code implementations11 Oct 2023 Jie An, Zhengyuan Yang, Linjie Li, JianFeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo

We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.

Question Answering Text Generation

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

1 code implementation29 Sep 2023 Zhengyuan Yang, Linjie Li, Kevin Lin, JianFeng Wang, Chung-Ching Lin, Zicheng Liu, Lijuan Wang

We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models.

Ranked #3 on MMR total on MRR-Benchmark (using extra training data)

MMR total

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

1 code implementation18 Sep 2023 Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants.

Survey Text-to-Image Generation

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

1 code implementation4 Aug 2023 Weihao Yu, Zhengyuan Yang, Linjie Li, JianFeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang

Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking.

Math Zero-Shot Visual Question Answring

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

no code implementations27 Jul 2023 Xin Yuan, Linjie Li, JianFeng Wang, Zhengyuan Yang, Kevin Lin, Zicheng Liu, Lijuan Wang

In this paper, we study the denoising diffusion probabilistic model (DDPM) in wavelet space, instead of pixel space, for visual synthesis.

Denoising

DisCo: Disentangled Control for Realistic Human Dance Generation

1 code implementation CVPR 2024 Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang

In this paper, we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects, backgrounds, and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects, backgrounds, and poses from different sources.

Attribute

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

2 code implementations13 Apr 2023 Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal

In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.

Layout-to-Image Generation

Equivariant Similarity for Vision-Language Foundation Models

1 code implementation ICCV 2023 Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang

Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes.

Image-text Retrieval Text Retrieval +2

SGFormer: Semantic Graph Transformer for Point Cloud-based 3D Scene Graph Generation

1 code implementation20 Mar 2023 Changsheng Lv, Mengshi Qi, Xia Li, Zhengyuan Yang, Huadong Ma

In this paper, we propose a novel model called SGFormer, Semantic Graph TransFormer for point cloud-based 3D scene graph generation.

3d scene graph generation Graph Embedding +3

PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3

no code implementations ICCV 2023 Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo

PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60. 4% on OK-VQA and 59. 6% on A-OKVQA).

Image Captioning Question Answering +3

GRiT: A Generative Region-to-text Transformer for Object Understanding

1 code implementation1 Dec 2022 Jialian Wu, JianFeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang

Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions.

Decoder Dense Captioning +4

PromptCap: Prompt-Guided Task-Aware Image Captioning

1 code implementation15 Nov 2022 Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, Jiebo Luo

PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60. 4% on OK-VQA and 59. 6% on A-OKVQA).

Image Captioning Language Modelling +5

Prompting GPT-3 To Be Reliable

1 code implementation17 Oct 2022 Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, JianFeng Wang, Jordan Boyd-Graber, Lijuan Wang

While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality.

Fairness Language Modelling

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

1 code implementation14 Jun 2022 Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang

For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers.

Visual Grounding

GIT: A Generative Image-to-text Transformer for Vision and Language

1 code implementation27 May 2022 JianFeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.

Decoder Image Captioning +9

Cross-modal Contrastive Distillation for Instructional Activity Anticipation

no code implementations18 Jan 2022 Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo

In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation.

Knowledge Distillation

Scaling Up Vision-Language Pre-training for Image Captioning

no code implementations CVPR 2022 Xiaowei Hu, Zhe Gan, JianFeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang

In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning.

Ranked #3 on Image Captioning on nocaps-XD entire (using extra training data)

Attribute Image Captioning

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

1 code implementation23 Nov 2021 Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang

On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations.

Image Captioning Language Modelling +5

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

no code implementations19 Nov 2021 JianFeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e. g., image or language) or multimodal inputs (e. g., the concatenation of the image and the question), for vision-language (VL) representation learning.

Image Captioning Image-text matching +9

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

1 code implementation10 Sep 2021 Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang

To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.

Ranked #20 on Visual Question Answering (VQA) on OK-VQA (using extra training data)

Image Captioning Question Answering +2

SAT: 2D Semantics Assisted Training for 3D Visual Grounding

1 code implementation ICCV 2021 Zhengyuan Yang, Songyang Zhang, LiWei Wang, Jiebo Luo

3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.

3D visual grounding Object +1

TransVG: End-to-End Visual Grounding with Transformers

2 code implementations ICCV 2021 Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.

Referring Expression Comprehension Visual Grounding

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

1 code implementation CVPR 2021 Zhengyuan Yang, Yijuan Lu, JianFeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo

Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5. 4%, compared with a non-TAP baseline.

Caption Generation Language Modelling +5

Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

no code implementations30 Oct 2020 Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms.

Action Recognition Emotion Recognition

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

1 code implementation4 Sep 2020 Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie zhou, Jiebo Luo

Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities.

Decoder Multimodal Machine Translation +2

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

1 code implementation CVPR 2021 Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu

Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.

Contrastive Learning Knowledge Distillation +6

Grounding-Tracking-Integration

no code implementations13 Dec 2019 Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jinsong Su, Jiebo Luo

In this paper, we study Tracking by Language that localizes the target box sequence in a video based on a language query.

Weakly Supervised Body Part Segmentation with Pose based Part Priors

no code implementations30 Jul 2019 Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo

The core idea is first converting the sparse weak labels such as keypoints to the initial estimate of body part masks, and then iteratively refine the part mask predictions.

Face Parsing Segmentation +1

Human-Centered Emotion Recognition in Animated GIFs

1 code implementation27 Apr 2019 Zhengyuan Yang, Yixuan Zhang, Jiebo Luo

The framework consists of a facial attention module and a hierarchical segment temporal module.

Emotion Recognition

Attentive Relational Networks for Mapping Images to Scene Graphs

no code implementations CVPR 2019 Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo

Scene graph generation refers to the task of automatically mapping an image into a semantic structural graph, which requires correctly labeling each extracted object and their interaction relationships.

Graph Generation Object +4

Action Recognition with Spatio-Temporal Visual Attention on Skeleton Image Sequences

no code implementations31 Jan 2018 Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo

The attention mechanism is important for skeleton based action recognition because there exist spatio-temporal key stages while the joint predictions can be inaccurate.

Action Recognition Skeleton Based Action Recognition +1

End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perception

1 code implementation20 Jan 2018 Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo

In this work, we propose a multi-task learning framework to predict the steering angle and speed control simultaneously in an end-to-end manner.

Autonomous Driving Multi-Task Learning +2

Cannot find the paper you are looking for? You can Submit a new open access paper.