1 code implementation • 5 Nov 2024 • Qin Liu, JianFeng Wang, Zhengyuan Yang, Linjie Li, Kevin Lin, Marc Niethammer, Lijuan Wang
Semi-supervised video object segmentation (VOS) has been largely driven by space-time memory (STM) networks, which store past frame features in a spatiotemporal memory to segment the current frame via softmax attention.
Semantic Segmentation Semi-Supervised Video Object Segmentation +1
no code implementations • 4 Nov 2024 • Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, JianFeng Wang, Gim Hee Lee, Lijuan Wang
Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos.
no code implementations • 30 Oct 2024 • Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Linjie Li, Kevin Lin, Chung-Ching Lin, JianFeng Wang, Zhengyuan Yang, YingNian Wu, Lijuan Wang
Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module.
no code implementations • 13 Oct 2024 • Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo
With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs.
no code implementations • 4 Oct 2024 • Zichen Miao, Zhengyuan Yang, Kevin Lin, Ze Wang, Zicheng Liu, Lijuan Wang, Qiang Qiu
We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data.
no code implementations • 3 Oct 2024 • Kaizhi Zheng, Xiaotong Chen, Xuehai He, Jing Gu, Linjie Li, Zhengyuan Yang, Kevin Lin, JianFeng Wang, Lijuan Wang, Xin Eric Wang
Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming.
no code implementations • 21 Aug 2024 • Minheng Ni, Chenfei Wu, Huaying Yuan, Zhengyuan Yang, Ming Gong, Lijuan Wang, Zicheng Liu, WangMeng Zuo, Nan Duan
With the advancement of generative models, the synthesis of different sensory elements such as music, visuals, and speech has achieved significant realism.
1 code implementation • 15 Jul 2024 • Yuanhao Zhai, Kevin Lin, Linjie Li, Chung-Ching Lin, JianFeng Wang, Zhengyuan Yang, David Doermann, Junsong Yuan, Zicheng Liu, Lijuan Wang
First, to enable dual-modal generation and maximize the information exchange between video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint video and depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables the mutual information flow.
no code implementations • 14 Jun 2024 • Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks.
1 code implementation • 12 Jun 2024 • Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, JianFeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang
Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics.
1 code implementation • 11 Jun 2024 • Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, JianFeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang
Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance.
1 code implementation • 25 Apr 2024 • An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, JianFeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang
Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image.
Ranked #104 on Visual Question Answering on MM-Vet
no code implementations • 19 Mar 2024 • JieLin Qiu, William Han, Winfred Wang, Zhengyuan Yang, Linjie Li, JianFeng Wang, Christos Faloutsos, Lei LI, Lijuan Wang
Open-domain real-world entity recognition is essential yet challenging, involving identifying various entities in diverse environments.
no code implementations • 5 Mar 2024 • Chenglei Si, Yanzhe Zhang, Zhengyuan Yang, Ruibo Liu, Diyi Yang
In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking.
no code implementations • 30 Jan 2024 • Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, Nan Duan
To leverage LLMs for visual synthesis, traditional methods convert raster image information into discrete grid tokens through specialized visual modules, while disrupting the model's ability to capture the true semantic representation of visual scenes.
no code implementations • 4 Jan 2024 • Jie An, Zhengyuan Yang, JianFeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, Jiebo Luo
The first module, similar to a standard DDPM, learns to predict the added noise and is unaffected by the metric function.
no code implementations • 1 Jan 2024 • Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, JianFeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
\ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters.
no code implementations • CVPR 2024 • Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Lijuan Wang, Qiang Qiu, Zicheng Liu
We also show the effectiveness of our RL fine-tuning framework on enhancing the diversity of image generation with different types of diffusion models including class-conditional models and text-conditional models e. g. StableDiffusion.
no code implementations • 21 Dec 2023 • Bingbing Wen, Zhengyuan Yang, JianFeng Wang, Zhe Gan, Bill Howe, Lijuan Wang
In this paper, we build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round even with external knowledge related to the visual content.
1 code implementation • 12 Dec 2023 • Xueyan Zou, Linjie Li, JianFeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang
To further unleash the power of foundation models, we present FIND, a generalized interface for aligning foundation models' embeddings with unified image and dataset-level understanding spanning modality and granularity.
no code implementations • CVPR 2024 • Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, JianFeng Wang, Linjie Li, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD).
2 code implementations • 13 Nov 2023 • An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, JianFeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
We first benchmark MM-Navigator on our collected iOS screen dataset.
1 code implementation • 13 Nov 2023 • Hanjia Lyu, Jinfa Huang, Daoan Zhang, Yongsheng Yu, Xinyi Mou, Jinsheng Pan, Zhengyuan Yang, Zhongyu Wei, Jiebo Luo
Our investigation begins with a preliminary quantitative analysis for each task using existing benchmark datasets, followed by a careful review of the results and a selection of qualitative samples that illustrate GPT-4V's potential in understanding multimodal social media content.
1 code implementation • 30 Oct 2023 • Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, JianFeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, Lijuan Wang
We present MM-VID, an integrated system that harnesses the capabilities of GPT-4V, combined with specialized tools in vision, audio, and speech, to facilitate advanced video understanding.
1 code implementation • 23 Oct 2023 • Kevin Lin, Zhengyuan Yang, Linjie Li, JianFeng Wang, Lijuan Wang
For DEsignBench benchmarking, we perform human evaluations on generated images in DEsignBench gallery, against the criteria of image-text alignment, visual aesthetic, and design creativity.
no code implementations • 12 Oct 2023 • Zhengyuan Yang, JianFeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
We introduce ``Idea to Image,'' a system that enables multimodal iterative self-refinement with GPT-4V(ision) for automatic image design and generation.
no code implementations • 11 Oct 2023 • Jie An, Zhengyuan Yang, Linjie Li, JianFeng Wang, Kevin Lin, Zicheng Liu, Lijuan Wang, Jiebo Luo
We hope our proposed framework, benchmark, and LMM evaluation could help establish the intriguing interleaved image-text generation task.
1 code implementation • 29 Sep 2023 • Zhengyuan Yang, Linjie Li, Kevin Lin, JianFeng Wang, Chung-Ching Lin, Zicheng Liu, Lijuan Wang
We hope that this preliminary exploration will inspire future research on the next-generation multimodal task formulation, new ways to exploit and enhance LMMs to solve real-world problems, and gaining better understanding of multimodal foundation models.
Ranked #3 on MMR total on MRR-Benchmark (using extra training data)
1 code implementation • 18 Sep 2023 • Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants.
1 code implementation • 4 Aug 2023 • Weihao Yu, Zhengyuan Yang, Linjie Li, JianFeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, Lijuan Wang
Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking.
no code implementations • 27 Jul 2023 • Xin Yuan, Linjie Li, JianFeng Wang, Zhengyuan Yang, Kevin Lin, Zicheng Liu, Lijuan Wang
In this paper, we study the denoising diffusion probabilistic model (DDPM) in wavelet space, instead of pixel space, for visual synthesis.
1 code implementation • CVPR 2024 • Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang
In this paper, we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects, backgrounds, and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects, backgrounds, and poses from different sources.
1 code implementation • CVPR 2024 • JieLin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, JianFeng Wang, Ding Zhao, Bo Li, Lijuan Wang
To address these challenges and provide a comprehensive dataset for this new direction, we have meticulously curated the \textbf{MMSum} dataset.
2 code implementations • 13 Apr 2023 • Jaemin Cho, Linjie Li, Zhengyuan Yang, Zhe Gan, Lijuan Wang, Mohit Bansal
In this paper, we propose LayoutBench, a diagnostic benchmark for layout-guided image generation that examines four categories of spatial control skills: number, position, size, and shape.
1 code implementation • ICCV 2023 • Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang
Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes.
Ranked #7 on Visual Reasoning on Winoground
no code implementations • 22 Mar 2023 • Shengming Yin, Chenfei Wu, Huan Yang, JianFeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan
In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation.
1 code implementation • 20 Mar 2023 • Changsheng Lv, Mengshi Qi, Xia Li, Zhengyuan Yang, Huadong Ma
In this paper, we propose a novel model called SGFormer, Semantic Graph TransFormer for point cloud-based 3D scene graph generation.
1 code implementation • 20 Mar 2023 • Zhengyuan Yang, Linjie Li, JianFeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang
We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
Ranked #62 on Visual Question Answering on MM-Vet
no code implementations • 21 Feb 2023 • Xiaodong Wang, Chenfei Wu, Shengming Yin, Minheng Ni, JianFeng Wang, Linjie Li, Zhengyuan Yang, Fan Yang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan
3D photography renders a static image into a video with appealing 3D visual effects.
Ranked #1 on Image Outpainting on MSCOCO
no code implementations • ICCV 2023 • Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A. Smith, Jiebo Luo
PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60. 4% on OK-VQA and 59. 6% on A-OKVQA).
1 code implementation • 1 Dec 2022 • Jialian Wu, JianFeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang
Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions.
Ranked #2 on Dense Captioning on Visual Genome
no code implementations • CVPR 2023 • Zhengyuan Yang, JianFeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang
Human evaluation on PaintSkill shows that ReCo is +19. 28% and +17. 21% more accurate in generating images with correct object count and spatial relationship than the T2I model.
Conditional Text-to-Image Synthesis Layout-to-Image Generation +1
1 code implementation • 15 Nov 2022 • Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, Jiebo Luo
PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60. 4% on OK-VQA and 59. 6% on A-OKVQA).
Ranked #1 on Visual Question Answering on TextVQA test-standard
1 code implementation • 17 Oct 2022 • Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, JianFeng Wang, Jordan Boyd-Graber, Lijuan Wang
While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality.
1 code implementation • 14 Jun 2022 • Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou, Yanyong Zhang, Houqiang Li, Wanli Ouyang
For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers.
1 code implementation • 27 May 2022 • JianFeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, Lijuan Wang
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Ranked #1 on Image Captioning on nocaps-XD near-domain
no code implementations • 18 Jan 2022 • Zhengyuan Yang, Jingen Liu, Jing Huang, Xiaodong He, Tao Mei, Chenliang Xu, Jiebo Luo
In this study, we aim to predict the plausible future action steps given an observation of the past and study the task of instructional activity anticipation.
no code implementations • CVPR 2022 • Xiaowei Hu, Zhe Gan, JianFeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, Lijuan Wang
In this paper, we present LEMON, a LargE-scale iMage captiONer, and provide the first empirical study on the scaling behavior of VLP for image captioning.
Ranked #3 on Image Captioning on nocaps-XD entire (using extra training data)
1 code implementation • 23 Nov 2021 • Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang
On grounded captioning, UniTAB presents a simpler solution with a single output head, and significantly outperforms state of the art in both grounding and captioning evaluations.
no code implementations • 19 Nov 2021 • JianFeng Wang, Xiaowei Hu, Zhe Gan, Zhengyuan Yang, Xiyang Dai, Zicheng Liu, Yumao Lu, Lijuan Wang
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e. g., image or language) or multimodal inputs (e. g., the concatenation of the image and the question), for vision-language (VL) representation learning.
1 code implementation • 10 Sep 2021 • Zhengyuan Yang, Zhe Gan, JianFeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, Lijuan Wang
To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA.
Ranked #20 on Visual Question Answering (VQA) on OK-VQA (using extra training data)
1 code implementation • ICCV 2021 • Zhengyuan Yang, Songyang Zhang, LiWei Wang, Jiebo Luo
3D visual grounding aims at grounding a natural language description about a 3D scene, usually represented in the form of 3D point clouds, to the targeted object region.
2 code implementations • ICCV 2021 • Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, Houqiang Li
In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image.
Ranked #16 on Referring Expression Comprehension on RefCOCO
1 code implementation • CVPR 2021 • Zhengyuan Yang, Yijuan Lu, JianFeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo
Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5. 4%, compared with a non-TAP baseline.
no code implementations • 30 Oct 2020 • Zhengyuan Yang, Amanda Kay, Yuncheng Li, Wendi Cross, Jiebo Luo
We then evaluate the framework on a proposed URMC dataset, which consists of conversations between a standardized patient and a behavioral health professional, along with expert annotations of body language, emotions, and potential psychiatric symptoms.
1 code implementation • 4 Sep 2020 • Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie zhou, Jiebo Luo
Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities.
Ranked #3 on Multimodal Machine Translation on Multi30K
1 code implementation • ECCV 2020 • Zhengyuan Yang, Tianlang Chen, Li-Wei Wang, Jiebo Luo
We improve one-stage visual grounding by addressing current limitations on grounding long and complex queries.
1 code implementation • ACL 2020 • Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie zhou, Jiebo Luo
Multi-modal neural machine translation (NMT) aims to translate source sentences into a target language paired with images.
1 code implementation • CVPR 2021 • Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, Dong Yu
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
no code implementations • 13 Dec 2019 • Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jinsong Su, Jiebo Luo
In this paper, we study Tracking by Language that localizes the target box sequence in a video based on a language query.
2 code implementations • ICCV 2019 • Zhengyuan Yang, Boqing Gong, Li-Wei Wang, Wenbing Huang, Dong Yu, Jiebo Luo
We propose a simple, fast, and accurate one-stage approach to visual grounding, inspired by the following insight.
no code implementations • 30 Jul 2019 • Zhengyuan Yang, Yuncheng Li, Linjie Yang, Ning Zhang, Jiebo Luo
The core idea is first converting the sparse weak labels such as keypoints to the initial estimate of body part masks, and then iteratively refine the part mask predictions.
1 code implementation • 27 Apr 2019 • Zhengyuan Yang, Yixuan Zhang, Jiebo Luo
The framework consists of a facial attention module and a hierarchical segment temporal module.
no code implementations • CVPR 2019 • Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, Jiebo Luo
Scene graph generation refers to the task of automatically mapping an image into a semantic structural graph, which requires correctly labeling each extracted object and their interaction relationships.
no code implementations • 31 Jan 2018 • Zhengyuan Yang, Yuncheng Li, Jianchao Yang, Jiebo Luo
The attention mechanism is important for skeleton based action recognition because there exist spatio-temporal key stages while the joint predictions can be inaccurate.
1 code implementation • 20 Jan 2018 • Zhengyuan Yang, Yixuan Zhang, Jerry Yu, Junjie Cai, Jiebo Luo
In this work, we propose a multi-task learning framework to predict the steering angle and speed control simultaneously in an end-to-end manner.