1 code implementation • 18 Sep 2023 • Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen
We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning.
Ranked #43 on Visual Question Answering on MM-Vet
5 code implementations • 9 Mar 2023 • Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang
To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.
Ranked #1 on Zero-Shot Object Detection on MSCOCO
2 code implementations • NeurIPS 2023 • Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, JianFeng Wang, Lijuan Wang, Jianfeng Gao, Yong Jae Lee
In SEEM, we propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks, aiming at a universal segmentation interface that behaves like large language models (LLMs).
6 code implementations • 22 Mar 2022 • Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao
For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2. 4, and beats Swin at multi-scale (50. 5 v. s.
Ranked #8 on Object Detection on COCO minival (using extra training data)
3 code implementations • 17 Oct 2023 • Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, Jianfeng Gao
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.
2 code implementations • 1 Nov 2023 • Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, Chunyuan Li
LLaVA-Interactive is a research prototype for multimodal human-AI interaction.
2 code implementations • CVPR 2022 • Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao
The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich.
Ranked #1 on 2D Object Detection on RF100
8 code implementations • 19 Apr 2022 • Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, Jianfeng Gao
In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets and tasks.
Ranked #1 on Object Detection on ELEVATER
1 code implementation • 10 Jul 2023 • Feng Li, Hao Zhang, Peize Sun, Xueyan Zou, Shilong Liu, Jianwei Yang, Chunyuan Li, Lei Zhang, Jianfeng Gao
In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity.
3 code implementations • 22 Nov 2023 • Feng Li, Qing Jiang, Hao Zhang, Tianhe Ren, Shilong Liu, Xueyan Zou, Huaizhe xu, Hongyang Li, Chunyuan Li, Jianwei Yang, Lei Zhang, Jianfeng Gao
In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain.
1 code implementation • 12 Jun 2023 • Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, Lei Zhang
To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation.
1 code implementation • CVPR 2023 • Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee
Large-scale text-to-image diffusion models have made amazing advances.
Ranked #4 on Conditional Text-to-Image Synthesis on COCO-MIG
1 code implementation • CVPR 2023 • Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, JianFeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly.
Ranked #4 on Instance Segmentation on ADE20K val (using extra training data)
2 code implementations • ICCV 2023 • Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, Lei Zhang
We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets.
Ranked #2 on Instance Segmentation on ADE20K val (using extra training data)
3 code implementations • 1 Jul 2021 • Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks.
Ranked #17 on Instance Segmentation on COCO test-dev
7 code implementations • CVPR 2021 • Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao
In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
Ranked #2 on Image-text matching on CommercialAdsDataset
1 code implementation • 18 Sep 2023 • Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao
This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose assistants.
3 code implementations • ECCV 2018 • Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh
We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images.
Ranked #12 on Scene Graph Generation on Visual Genome
2 code implementations • 25 Apr 2023 • Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, Lei Zhang
This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64. 6 AP on COCO val2017 and 64. 8 AP on COCO test-dev using only 700M parameters without any test time augmentation.
Ranked #5 on Object Detection on COCO minival (using extra training data)
1 code implementation • CVPR 2022 • Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao
However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans.
Ranked #10 on Open Vocabulary Object Detection on MSCOCO (using extra training data)
1 code implementation • 9 Nov 2023 • Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan Li
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models.
Ranked #1 on LMM real-life tasks on Leaderboard
1 code implementation • NeurIPS 2021 • Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao
With focal attention, we propose a new variant of Vision Transformer models, called Focal Transformers, which achieve superior performance over the state-of-the-art (SoTA) Vision Transformers on a range of public image classification and object detection benchmarks.
1 code implementation • CVPR 2018 • Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image.
3 code implementations • ICCV 2021 • Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques.
Ranked #45 on Instance Segmentation on COCO minival
1 code implementation • ICLR 2022 • Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao
This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning.
Ranked #16 on Self-Supervised Image Classification on ImageNet
Representation Learning Self-Supervised Image Classification
1 code implementation • 27 Jul 2021 • Xiaotian Han, Jianwei Yang, Houdong Hu, Lei Zhang, Jianfeng Gao, Pengchuan Zhang
There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection.
1 code implementation • 22 Nov 2021 • Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang
Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.
Ranked #1 on Action Recognition In Videos on Kinetics-600
1 code implementation • CVPR 2022 • Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, Jianfeng Gao
Particularly, it attains gains up to 9. 2% and 14. 5% in average on zero-shot recognition benchmarks over the language-image contrastive learning and supervised learning methods, respectively.
2 code implementations • 20 Apr 2022 • Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao
We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts.
9 code implementations • NeurIPS 2016 • Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Ranked #3 on Visual Question Answering (VQA) on VQA v1 test-std
3 code implementations • CVPR 2016 • Jianwei Yang, Devi Parikh, Dhruv Batra
In this paper, we propose a recurrent framework for Joint Unsupervised LEarning (JULE) of deep representations and image clusters.
Ranked #1 on Image Clustering on Coil-20
1 code implementation • 5 Dec 2023 • Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu, Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang
To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities.
1 code implementation • 21 Dec 2023 • Jitesh Jain, Jianwei Yang, Humphrey Shi
Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task.
3 code implementations • 24 Aug 2014 • Jianwei Yang, Zhen Lei, Stan Z. Li
Moreover, the nets trained using combined data from two datasets have less biases between two datasets.
Ranked #2 on Face Anti-Spoofing on CASIA-MFSD
1 code implementation • 5 Mar 2017 • Jianwei Yang, Anitha Kannan, Dhruv Batra, Devi Parikh
We present LR-GAN: an adversarial image generation model which takes scene structure and context into account.
Ranked #4 on Image Generation on Stanford Cars
1 code implementation • CVPR 2023 • Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Yong Jae Lee, Chunyuan Li
Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability.
Ranked #1 on Semi-Supervised Image Classification on ImageNet - 1% labeled data (using extra training data)
1 code implementation • NeurIPS 2017 • Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra
In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.
Ranked #8 on Visual Dialog on VisDial v0.9 val
2 code implementations • 13 Nov 2023 • An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, JianFeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang
We first benchmark MM-Navigator on our collected iOS screen dataset.
1 code implementation • ICCV 2021 • Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li
To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
1 code implementation • 12 Dec 2023 • Xueyan Zou, Linjie Li, JianFeng Wang, Jianwei Yang, Mingyu Ding, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang
The proposed interface is adaptive to new tasks, and new models.
2 code implementations • 29 Mar 2022 • Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, Xin Eric Wang
In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classification task.
1 code implementation • NeurIPS 2019 • Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, Devi Parikh
Convolutional neural networks process input data by sending channel-wise feature response maps to subsequent layers.
1 code implementation • 18 Nov 2020 • Hassan Akbari, Hamid Palangi, Jianwei Yang, Sudha Rao, Asli Celikyilmaz, Roland Fernandez, Paul Smolensky, Jianfeng Gao, Shih-Fu Chang
In this paper, we propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
1 code implementation • IEEE Transactions on Circuits and Systems I: Regular Papers 2020 • Guozhu Xin, Jun Han, Tianyu Yin, Yuchao Zhou, Jianwei Yang, Xu Cheng, Xiaoyang Zeng
In the 5G era, massive devices need to be securely connected to the edge of communication networks, while emerging quantum computers can easily crack the traditional public-key ciphers.
Hardware Architecture
1 code implementation • 18 Oct 2023 • Cheng-Fu Yang, Yen-Chun Chen, Jianwei Yang, Xiyang Dai, Lu Yuan, Yu-Chiang Frank Wang, Kai-Wei Chang
Additional analysis shows that the contrastive objective and meta-actions are complementary in achieving the best results, and the resulting agent better aligns its states with corresponding instructions, making it more suitable for real-world embodied agents.
no code implementations • 1 Oct 2018 • Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh
Our question generation policy generalizes to new environments and a new pair of eyes, i. e., new visual system.
no code implementations • 9 Apr 2019 • Jianwei Yang, Zhile Ren, Mingze Xu, Xinlei Chen, David Crandall, Devi Parikh, Dhruv Batra
Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded.
no code implementations • ICCV 2019 • Jianwei Yang, Zhile Ren, Mingze Xu, Xinlei Chen, David J. Crandall, Devi Parikh, Dhruv Batra
Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded.
no code implementations • 22 May 2020 • Yuhang Song, Wenbo Li, Lei Zhang, Jianwei Yang, Emre Kiciman, Hamid Palangi, Jianfeng Gao, C. -C. Jay Kuo, Pengchuan Zhang
We study in this paper the problem of novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenarios.
no code implementations • 1 Jan 2021 • Jianwei Yang, Yonatan Bisk, Jianfeng Gao
Building video and language understanding models requires grounding linguistic concepts and video contents into a shared space.
no code implementations • 21 Dec 2020 • Jianwei Yang, Jiayuan Mao, Jiajun Wu, Devi Parikh, David D. Cox, Joshua B. Tenenbaum, Chuang Gan
In contrast, symbolic and modular models have a relatively better grounding and robustness, though at the cost of accuracy.
no code implementations • ICCV 2021 • Jianwei Yang, Yonatan Bisk, Jianfeng Gao
This is motivated by the observation that for a video-text pair, the content words in the text, such as nouns and verbs, are more likely to be aligned with the visual contents in the video than the function words.
Ranked #3 on Temporal Action Localization on CrossTask (using extra training data)
no code implementations • ICCV 2021 • Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, Lei Zhang
To mitigate the second limitation of learning difficulty, we introduce a dynamic decoder by replacing the cross-attention module with a ROI-based dynamic attention in the Transformer decoder.
no code implementations • 15 Jan 2022 • Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan
Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.
no code implementations • 22 Apr 2022 • Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-Fu Chang, Lu Yuan
Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.
Ranked #4 on Visual Question Answering (VQA) on VCR (Q-A) test
no code implementations • 16 Oct 2023 • Yu Gu, Jianwei Yang, Naoto Usuyama, Chunyuan Li, Sheng Zhang, Matthew P. Lungren, Jianfeng Gao, Hoifung Poon
In a comprehensive battery of tests on counterfactual medical image generation, BiomedJourney substantially outperforms prior state-of-the-art methods in instruction image editing and medical image generation such as InstructPix2Pix and RoentGen.
no code implementations • NeurIPS 2023 • Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, Jianfeng Gao
In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
no code implementations • 4 Dec 2023 • Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang
Given a textual description of a visual task (e. g. "Left: input image, Right: foreground segmentation"), a few input-output visual examples, or both, the model in-context learns to solve it for a new test input.
no code implementations • 15 Jan 2024 • Ho Hin Lee, Yu Gu, Theodore Zhao, Yanbo Xu, Jianwei Yang, Naoto Usuyama, Cliff Wong, Mu Wei, Bennett A. Landman, Yuankai Huo, Alberto Santamaria-Pang, Hoifung Poon
This transformative technology, originally developed for general-purpose computer vision, has found rapid application in medical image processing.
no code implementations • 7 Mar 2024 • Hitesh Kandala, Jianfeng Gao, Jianwei Yang
We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation.
no code implementations • 12 Mar 2024 • Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, ZiYi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Serena Yeung-Levy, Curtis P. Langlotz, Sheng Wang, Hoifung Poon
Frontier models such as GPT-4V still have major competency gaps in multimodal capabilities for biomedical applications.