Search Results for author: Zhengyuan Yang

Found 33 papers, 21 papers with code

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

1 code implementation13 Apr 2023 Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal

In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.

Layout-to-Image Generation

Equivariant Similarity for Vision-Language Foundation Models

1 code implementation25 Mar 2023 Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang

Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes.

Retrieval Text Retrieval +1

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

1 code implementation20 Mar 2023 Zhengyuan Yang, Linjie Li, JianFeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.

GRiT: A Generative Region-to-text Transformer for Object Understanding

1 code implementation1 Dec 2022 Jialian Wu, JianFeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang

Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions.

Dense Captioning Descriptive +2

ReCo: Region-Controlled Text-to-Image Generation

no code implementations CVPR 2023 Zhengyuan Yang, JianFeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang

Human evaluation on PaintSkill shows that ReCo is +19. 28% and +17. 21% more accurate in generating images with correct object count and spatial relationship than the T2I model.

Text-to-Image Generation

PromptCap: Prompt-Guided Task-Aware Image Captioning

1 code implementation15 Nov 2022 Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo

PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60. 4% on OK-VQA and 59. 6% on A-OKVQA).

Image Captioning Language Modelling +3

Prompting GPT-3 To Be Reliable

1 code implementation17 Oct 2022 Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, JianFeng Wang, Jordan Boyd-Graber, Lijuan Wang

While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality.

Fairness Language Modelling

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

1 code implementation14 Jun 2022 Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang

For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers.

Visual Grounding

GIT: A Generative Image-to-text Transformer for Vision and Language

2 code implementations27 May 2022 JianFeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.

Image Classification Language Modelling +5

Cross-modal Contrastive Distillation for Instructional Activity Anticipation

no code implementations18 Jan 2022 Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo

In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation.

Knowledge Distillation

Scaling Up Vision-Language Pre-training for Image Captioning

no code implementations CVPR 2022 Xiaowei Hu, Zhe Gan, JianFeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang

In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning.

Ranked #3 on Image Captioning on nocaps-XD entire (using extra training data)

Image Captioning

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

1 code implementation23 Nov 2021 Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang

On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations.

Image Captioning Language Modelling +5

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

no code implementations19 Nov 2021 JianFeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang

In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e. g., image or language) or multimodal inputs (e. g., the concatenation of the image and the question), for vision-language (VL) representation learning.

Image Captioning Language Modelling +8

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

1 code implementation10 Sep 2021 Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang

To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.

Image Captioning Question Answering +2

SAT: 2D Semantics Assisted Training for 3D Visual Grounding

1 code implementation ICCV 2021 Zhengyuan Yang, Songyang Zhang, LiWei Wang, Jiebo Luo

3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.

Representation Learning Visual Grounding

TransVG: End-to-End Visual Grounding with Transformers

2 code implementations ICCV 2021 Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.

Referring Expression Comprehension Visual Grounding

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

1 code implementation CVPR 2021 Zhengyuan Yang, Yijuan Lu, JianFeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo

Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5. 4%, compared with a non-TAP baseline.

Language Modelling Masked Language Modeling +4

Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

no code implementations30 Oct 2020 Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo

We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms.

Action Recognition Emotion Recognition

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

1 code implementation4 Sep 2020 Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie zhou, Jiebo Luo

Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities.

Multimodal Machine Translation Representation Learning +1

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

1 code implementation CVPR 2021 Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu

Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.

Contrastive Learning Knowledge Distillation +4

Grounding-Tracking-Integration

no code implementations13 Dec 2019 Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jinsong Su, Jiebo Luo

In this paper, we study Tracking by Language that localizes the target box sequence in a video based on a language query.

Weakly Supervised Body Part Segmentation with Pose based Part Priors

no code implementations30 Jul 2019 Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo

The core idea is first converting the sparse weak labels such as keypoints to the initial estimate of body part masks, and then iteratively refine the part mask predictions.

Face Parsing Semantic Segmentation

Human-Centered Emotion Recognition in Animated GIFs

1 code implementation27 Apr 2019 Zhengyuan Yang, Yixuan Zhang, Jiebo Luo

The framework consists of a facial attention module and a hierarchical segment temporal module.

Emotion Recognition

Attentive Relational Networks for Mapping Images to Scene Graphs

no code implementations CVPR 2019 Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo

Scene graph generation refers to the task of automatically mapping an image into a semantic structural graph, which requires correctly labeling each extracted object and their interaction relationships.

Graph Generation object-detection +2

Action Recognition with Spatio-Temporal Visual Attention on Skeleton Image Sequences

no code implementations31 Jan 2018 Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo

The attention mechanism is important for skeleton based action recognition because there exist spatio-temporal key stages while the joint predictions can be inaccurate.

Action Recognition Skeleton Based Action Recognition +1

End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perception

1 code implementation20 Jan 2018 Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo

In this work, we propose a multi-task learning framework to predict the steering angle and speed control simultaneously in an end-to-end manner.

Autonomous Driving Multi-Task Learning +2

Cannot find the paper you are looking for? You can Submit a new open access paper.