Search Results for author: Jianwei Yang

Found 39 papers, 28 papers with code

A Simple Framework for Open-Vocabulary Segmentation and Detection

2 code implementations14 Mar 2023 Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, Lei Zhang

We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation and detection datasets.

 Ranked #1 on Instance Segmentation on ADE20K val (using extra training data)

Instance Segmentation Panoptic Segmentation

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

1 code implementation9 Mar 2023 Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang

To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion.

object-detection Referring Expression +2

Generalized Decoding for Pixel, Image, and Language

1 code implementation21 Dec 2022 Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, JianFeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, Jianfeng Gao

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly.

Ranked #3 on Instance Segmentation on ADE20K val (using extra training data)

Image Segmentation Panoptic Segmentation +1

Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks

no code implementations22 Apr 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Xiyang Dai, Bin Xiao, Jianwei Yang, Haoxuan You, Kai-Wei Chang, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that MAD leads to consistent gains in the low-shot, domain-shifted, and fully-supervised conditions on VCR, SNLI-VE, and VQA, achieving SOTA performance on VCR compared to other single models pretrained with image-text data.

Question Answering Visual Commonsense Reasoning +2

K-LITE: Learning Transferable Visual Models with External Knowledge

1 code implementation20 Apr 2022 Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao

We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts.

Benchmarking Image Classification +3

Unified Contrastive Learning in Image-Text-Label Space

1 code implementation CVPR 2022 Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, Jianfeng Gao

Particularly, it attains gains up to 9. 2% and 14. 5% in average on zero-shot recognition benchmarks over the language-image contrastive learning and supervised learning methods, respectively.

Contrastive Learning Image Classification +2

Parameter-efficient Model Adaptation for Vision Transformers

2 code implementations29 Mar 2022 Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, Xin Eric Wang

In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classification task.

Benchmarking Classification +2

Focal Modulation Networks

5 code implementations22 Mar 2022 Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan, Jianfeng Gao

For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2. 4, and beats Swin at multi-scale (50. 5 v. s.

Ranked #5 on Object Detection on COCO minival (using extra training data)

Image Classification Object Detection +1

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

no code implementations15 Jan 2022 Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, Lu Yuan

Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51. 9%) and domain-shifted (up to 71. 3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only.

Question Answering Visual Commonsense Reasoning +2

RegionCLIP: Region-based Language-Image Pretraining

1 code implementation CVPR 2022 Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao

However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans.

Ranked #4 on Open Vocabulary Object Detection on MSCOCO (using extra training data)

Image Classification object-detection +2

Grounded Language-Image Pre-training

1 code implementation CVPR 2022 Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao

The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich.

2D object detection object-detection +2

Focal Attention for Long-Range Interactions in Vision Transformers

1 code implementation NeurIPS 2021 Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

With focal attention, we propose a new variant of Vision Transformer models, called Focal Transformers, which achieve superior performance over the state-of-the-art (SoTA) Vision Transformers on a range of public image classification and object detection benchmarks.

Image Classification object-detection +2

Florence: A New Foundation Model for Computer Vision

1 code implementation22 Nov 2021 Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang

Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.

Action Classification Action Recognition In Videos +11

Learning to Generate Scene Graph from Natural Language Supervision

1 code implementation ICCV 2021 Yiwu Zhong, Jing Shi, Jianwei Yang, Chenliang Xu, Yin Li

To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.

Graph Generation Scene Graph Generation

TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment

no code implementations ICCV 2021 Jianwei Yang, Yonatan Bisk, Jianfeng Gao

This is motivated by the observation that for a video-text pair, the content words in the text, such as nouns and verbs, are more likely to be aligned with the visual contents in the video than the function words.

Action Segmentation Contrastive Learning +4

Image Scene Graph Generation (SGG) Benchmark

1 code implementation27 Jul 2021 Xiaotian Han, Jianwei Yang, Houdong Hu, Lei Zhang, Jianfeng Gao, Pengchuan Zhang

There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection.

Graph Generation object-detection +3

Focal Self-attention for Local-Global Interactions in Vision Transformers

3 code implementations1 Jul 2021 Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks.

Image Classification Instance Segmentation +3

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

3 code implementations ICCV 2021 Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques.

Image Classification Instance Segmentation +2

VinVL: Revisiting Visual Representations in Vision-Language Models

7 code implementations CVPR 2021 Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao

In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

Image Captioning object-detection +1

Dynamic DETR: End-to-End Object Detection With Dynamic Attention

no code implementations ICCV 2021 Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, Lei Zhang

To mitigate the second limitation of learning difficulty, we introduce a dynamic decoder by replacing the cross-attention module with a ROI-based dynamic attention in the Transformer decoder.

object-detection Object Detection

Token-Level Contrast for Video and Language Alignment

no code implementations1 Jan 2021 Jianwei Yang, Yonatan Bisk, Jianfeng Gao

Building video and language understanding models requires grounding linguistic concepts and video contents into a shared space.

Object-Centric Diagnosis of Visual Reasoning

no code implementations21 Dec 2020 Jianwei Yang, Jiayuan Mao, Jiajun Wu, Devi Parikh, David D. Cox, Joshua B. Tenenbaum, Chuang Gan

In contrast, symbolic and modular models have a relatively better grounding and robustness, though at the cost of accuracy.

Question Answering Visual Question Answering (VQA) +1

Novel Human-Object Interaction Detection via Adversarial Domain Generalization

no code implementations22 May 2020 Yuhang Song, Wenbo Li, Lei Zhang, Jianwei Yang, Emre Kiciman, Hamid Palangi, Jianfeng Gao, C. -C. Jay Kuo, Pengchuan Zhang

We study in this paper the problem of novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenarios.

Domain Generalization Human-Object Interaction Detection

Cross-channel Communication Networks

1 code implementation NeurIPS 2019 Jianwei Yang, Zhile Ren, Chuang Gan, Hongyuan Zhu, Devi Parikh

Convolutional neural networks process input data by sending channel-wise feature response maps to subsequent layers.

Embodied Visual Recognition

no code implementations9 Apr 2019 Jianwei Yang, Zhile Ren, Mingze Xu, Xinlei Chen, David Crandall, Devi Parikh, Dhruv Batra

Passive visual systems typically fail to recognize objects in the amodal setting where they are heavily occluded.

Object Localization Semantic Segmentation

Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition

no code implementations1 Oct 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

Our question generation policy generalizes to new environments and a new pair of eyes, i. e., new visual system.

Question Generation Question-Generation

Graph R-CNN for Scene Graph Generation

3 code implementations ECCV 2018 Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, Devi Parikh

We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images.

Graph Generation Scene Graph Generation

Neural Baby Talk

1 code implementation CVPR 2018 Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image.

Image Captioning slot-filling +1

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

1 code implementation NeurIPS 2017 Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra

In contrast, discriminative dialog models (D) that are trained to rank a list of candidate human responses outperform their generative counterparts; in terms of automatic metrics, diversity, and informativeness of the responses.

Informativeness Metric Learning +2

LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation

1 code implementation5 Mar 2017 Jianwei Yang, Anitha Kannan, Dhruv Batra, Devi Parikh

We present LR-GAN: an adversarial image generation model which takes scene structure and context into account.

Image Generation

Hierarchical Question-Image Co-Attention for Visual Question Answering

9 code implementations NeurIPS 2016 Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).

Visual Dialog Visual Question Answering (VQA)

Joint Unsupervised Learning of Deep Representations and Image Clusters

2 code implementations CVPR 2016 Jianwei Yang, Devi Parikh, Dhruv Batra

In this paper, we propose a recurrent framework for Joint Unsupervised LEarning (JULE) of deep representations and image clusters.

Image Clustering Representation Learning

Learn Convolutional Neural Network for Face Anti-Spoofing

2 code implementations24 Aug 2014 Jianwei Yang, Zhen Lei, Stan Z. Li

Moreover, the nets trained using combined data from two datasets have less biases between two datasets.

Face Anti-Spoofing

Cannot find the paper you are looking for? You can Submit a new open access paper.