1 code implementation • 7 Jun 2023 • JieLin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, JianFeng Wang, Bo Li, Ding Zhao, Lijuan Wang
To address these challenges and provide a comprehensive dataset for this new direction, we have meticulously curated the MultiSum dataset.
1 code implementation • 13 Apr 2023 • Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal
In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.
Ranked #1 on
Layout-to-Image Generation
on LayoutBench
1 code implementation • 25 Mar 2023 • Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang
Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes.
no code implementations • 22 Mar 2023 • Shengming Yin, Chenfei Wu, Huan Yang, JianFeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan
In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation.
1 code implementation • 20 Mar 2023 • Zhengyuan Yang, Linjie Li, JianFeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang
We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
no code implementations • 20 Mar 2023 • Changsheng Lv, Mengshi Qi, Xia Li, Zhengyuan Yang, Huadong Ma
In this paper, we propose the semantic graph Transformer (SGT) for 3D scene graph generation.
no code implementations • 21 Feb 2023 • Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, JianFeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan
3D photography renders a static image into a video with appealing 3D visual effects.
Ranked #1 on
Image Outpainting
on MSCOCO
1 code implementation • 1 Dec 2022 • Jialian Wu, JianFeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang
Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions.
Ranked #1 on
Dense Captioning
on Visual Genome
no code implementations • CVPR 2023 • Zhengyuan Yang, JianFeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang
Human evaluation on PaintSkill shows that ReCo is +19. 28% and +17. 21% more accurate in generating images with correct object count and spatial relationship than the T2I model.
1 code implementation • 15 Nov 2022 • Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo
PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60. 4% on OK-VQA and 59. 6% on A-OKVQA).
Ranked #2 on
Visual Question Answering (VQA)
on A-OKVQA
1 code implementation • 17 Oct 2022 • Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, JianFeng Wang, Jordan Boyd-Graber, Lijuan Wang
While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality.
1 code implementation • 14 Jun 2022 • Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang
For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers.
2 code implementations • 27 May 2022 • JianFeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Ranked #1 on
Image Captioning
on nocaps-XD out-of-domain
no code implementations • 18 Jan 2022 • Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo
In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation.
no code implementations • CVPR 2022 • Xiaowei Hu, Zhe Gan, JianFeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang
In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning.
Ranked #3 on
Image Captioning
on nocaps-XD entire
(using extra training data)
1 code implementation • 23 Nov 2021 • Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang
On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations.
no code implementations • 19 Nov 2021 • JianFeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e. g., image or language) or multimodal inputs (e. g., the concatenation of the image and the question), for vision-language (VL) representation learning.
1 code implementation • 10 Sep 2021 • Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang
To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.
Ranked #10 on
Visual Question Answering (VQA)
on OK-VQA
1 code implementation • ICCV 2021 • Zhengyuan Yang, Songyang Zhang, LiWei Wang, Jiebo Luo
3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
2 code implementations • ICCV 2021 • Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.
Ranked #13 on
Referring Expression Comprehension
on RefCOCO
1 code implementation • CVPR 2021 • Zhengyuan Yang, Yijuan Lu, JianFeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo
Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5. 4%, compared with a non-TAP baseline.
no code implementations • 30 Oct 2020 • Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo
We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms.
1 code implementation • 4 Sep 2020 • Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie zhou, Jiebo Luo
Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities.
Ranked #3 on
Multimodal Machine Translation
on Multi30K
1 code implementation • ECCV 2020 • Zhengyuan Yang, Tianlang Chen, Li-Wei Wang, Jiebo Luo
We improve one-stage visual grounding by addressing current limitations on grounding long and complex queries.
1 code implementation • ACL 2020 • Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie zhou, Jiebo Luo
Multi-modal neural machine translation (NMT) aims to translate source sentences into a target language paired with images.
1 code implementation • CVPR 2021 • Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
no code implementations • 13 Dec 2019 • Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jinsong Su, Jiebo Luo
In this paper, we study Tracking by Language that localizes the target box sequence in a video based on a language query.
2 code implementations • ICCV 2019 • Zhengyuan Yang, Boqing Gong, Li-Wei Wang, Wenbing Huang, Dong Yu, Jiebo Luo
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
no code implementations • 30 Jul 2019 • Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo
The core idea is first converting the sparse weak labels such as keypoints to the initial estimate of body part masks, and then iteratively refine the part mask predictions.
1 code implementation • 27 Apr 2019 • Zhengyuan Yang, Yixuan Zhang, Jiebo Luo
The framework consists of a facial attention module and a hierarchical segment temporal module.
no code implementations • CVPR 2019 • Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo
Scene graph generation refers to the task of automatically mapping an image into a semantic structural graph, which requires correctly labeling each extracted object and their interaction relationships.
no code implementations • 31 Jan 2018 • Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo
The attention mechanism is important for skeleton based action recognition because there exist spatio-temporal key stages while the joint predictions can be inaccurate.
1 code implementation • 20 Jan 2018 • Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo
In this work, we propose a multi-task learning framework to predict the steering angle and speed control simultaneously in an end-to-end manner.