Search Results for author: Pengchuan Zhang

Found 56 papers, 34 papers with code

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

no code implementations CVPR 2024 Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu

We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K.

Scene Understanding

Evaluating Text-to-Visual Generation with Image-to-Text Generation

2 code implementations1 Apr 2024 Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, Deva Ramanan

For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations.

Question Answering Text Generation +2

The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task

no code implementations15 Nov 2023 Yifan Wu, Pengchuan Zhang, Wenhan Xiong, Barlas Oguz, James C. Gee, Yixin Nie

The study explores the effectiveness of the Chain-of-Thought approach, known for its proficiency in language tasks by breaking them down into sub-tasks and intermediate steps, in improving vision-language tasks that demand sophisticated perception and reasoning.

Visual Reasoning

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

1 code implementation14 Oct 2023 Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny

Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others.

Language Modelling Large Language Model +4

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

no code implementations20 Sep 2023 Mohamed Afham, Satya Narayan Shukla, Omid Poursaeed, Pengchuan Zhang, Ashish Shah, SerNam Lim

While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length.

Temporal Action Localization Video Classification +1

UniVTG: Towards Unified Video-Language Temporal Grounding

1 code implementation ICCV 2023 Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels.

Highlight Detection Moment Retrieval +3

Revisiting the Role of Language Priors in Vision-Language Models

1 code implementation2 Jun 2023 Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

Our first observation is that they can be repurposed for discriminative tasks (such as image-text retrieval) by simply computing the match score of generating a particular text string given an image.

Image-text matching Language Modelling +6

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

no code implementations23 May 2023 Harman Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong, Jingfei Du, Yu Chen

Along with this, we propose novel negative mining techniques in the scene graph space for improving attribute binding and relation understanding.

 Ranked #1 on Image Retrieval on CREPE (Compositional REPresentation Evaluation) (Recall@1 (HN-Comp, UC) metric)

Attribute Contrastive Learning +4

DIME-FM: DIstilling Multimodal and Efficient Foundation Models

no code implementations31 Mar 2023 Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko, Xide Xia

We transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32 model, with only 40M public images and 28. 4M unpaired public sentences.

Image Classification

DIME-FM : DIstilling Multimodal and Efficient Foundation Models

no code implementations ICCV 2023 Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko, Xide Xia

In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences.

Image Classification

Unifying Tracking and Image-Video Object Detection

no code implementations20 Nov 2022 Peirong Liu, Rui Wang, Pengchuan Zhang, Omid Poursaeed, Yipin Zhou, Xuefei Cao, Sreya Dutta Roy, Ashish Shah, Ser-Nam Lim

We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model.

Multi-Object Tracking Object +2

GLIPv2: Unifying Localization and Vision-Language Understanding

1 code implementation12 Jun 2022 Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e. g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e. g., VQA, image captioning).

 Ranked #1 on Phrase Grounding on Flickr30k Entities Test (using extra training data)

Contrastive Learning Image Captioning +7

K-LITE: Learning Transferable Visual Models with External Knowledge

2 code implementations20 Apr 2022 Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao

We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts.

Benchmarking Descriptive +4

Missingness Bias in Model Debugging

1 code implementation ICLR 2022 Saachi Jain, Hadi Salman, Eric Wong, Pengchuan Zhang, Vibhav Vineet, Sai Vemprala, Aleksander Madry

Missingness, or the absence of features from an input, is a concept fundamental to many model debugging tools.

Unified Contrastive Learning in Image-Text-Label Space

1 code implementation CVPR 2022 Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, Jianfeng Gao

Particularly, it attains gains up to 9. 2% and 14. 5% in average on zero-shot recognition benchmarks over the language-image contrastive learning and supervised learning methods, respectively.

Contrastive Learning Image Classification +2

Parameter-efficient Model Adaptation for Vision Transformers

2 code implementations29 Mar 2022 Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, Xin Eric Wang

In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classification task.

Benchmarking Classification +2

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

no code implementations3 Mar 2022 Feng Li, Hao Zhang, Yi-Fan Zhang, Shilong Liu, Jian Guo, Lionel M. Ni, Pengchuan Zhang, Lei Zhang

This survey is inspired by the remarkable progress in both computer vision and natural language processing, and recent trends shifting from single modality processing to multiple modality comprehension.

Few-Shot Learning Representation Learning

RegionCLIP: Region-based Language-Image Pretraining

1 code implementation CVPR 2022 Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao

However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans.

Ranked #12 on Open Vocabulary Object Detection on MSCOCO (using extra training data)

Image Classification Object +3

Grounded Language-Image Pre-training

2 code implementations CVPR 2022 Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao

The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich.

Described Object Detection Few-Shot Object Detection +1

Focal Attention for Long-Range Interactions in Vision Transformers

1 code implementation NeurIPS 2021 Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

With focal attention, we propose a new variant of Vision Transformer models, called Focal Transformers, which achieve superior performance over the state-of-the-art (SoTA) Vision Transformers on a range of public image classification and object detection benchmarks.

Image Classification object-detection +2

Florence: A New Foundation Model for Computer Vision

1 code implementation22 Nov 2021 Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, JianFeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang

Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications.

Action Classification Action Recognition In Videos +12

Image Scene Graph Generation (SGG) Benchmark

1 code implementation27 Jul 2021 Xiaotian Han, Jianwei Yang, Houdong Hu, Lei Zhang, Jianfeng Gao, Pengchuan Zhang

There is a surge of interest in image scene graph generation (object, attribute and relationship detection) due to the need of building fine-grained image understanding models that go beyond object detection.

Attribute Graph Generation +6

Focal Self-attention for Local-Global Interactions in Vision Transformers

3 code implementations1 Jul 2021 Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, Jianfeng Gao

With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks.

Image Classification Instance Segmentation +3

3DB: A Framework for Debugging Computer Vision Models

1 code implementation7 Jun 2021 Guillaume Leclerc, Hadi Salman, Andrew Ilyas, Sai Vemprala, Logan Engstrom, Vibhav Vineet, Kai Xiao, Pengchuan Zhang, Shibani Santurkar, Greg Yang, Ashish Kapoor, Aleksander Madry

We introduce 3DB: an extendable, unified framework for testing and debugging vision models using photorealistic simulation.

Multiscale Invertible Generative Networks for High-Dimensional Bayesian Inference

no code implementations12 May 2021 Shumao Zhang, Pengchuan Zhang, Thomas Y. Hou

We propose a Multiscale Invertible Generative Network (MsIGN) and associated training algorithm that leverages multiscale structure to solve high-dimensional Bayesian inference.

Bayesian Inference Image Generation +1

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

3 code implementations ICCV 2021 Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, Jianfeng Gao

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of \cite{dosovitskiy2020image} for encoding high-resolution images using two techniques.

Image Classification Instance Segmentation +2

Out-of-distribution Prediction with Invariant Risk Minimization: The Limitation and An Effective Fix

no code implementations16 Jan 2021 Ruocheng Guo, Pengchuan Zhang, Hao liu, Emre Kiciman

Nevertheless, we find that the performance of IRM can be dramatically degraded under \emph{strong $\Lambda$ spuriousness} -- when the spurious correlation between the spurious features and the class label is strong due to the strong causal influence of their common cause, the domain label, on both of them (see Fig.

VinVL: Revisiting Visual Representations in Vision-Language Models

7 code implementations CVPR 2021 Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao

In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.

Image Captioning Image-text matching +4

Dynamic DETR: End-to-End Object Detection With Dynamic Attention

no code implementations ICCV 2021 Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, Lei Zhang

To mitigate the second limitation of learning difficulty, we introduce a dynamic decoder by replacing the cross-attention module with a ROI-based dynamic attention in the Transformer decoder.

Decoder object-detection +1

MiniVLM: A Smaller and Faster Vision-Language Model

no code implementations13 Dec 2020 JianFeng Wang, Xiaowei Hu, Pengchuan Zhang, Xiujun Li, Lijuan Wang, Lei Zhang, Jianfeng Gao, Zicheng Liu

We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model.

Language Modelling

MagGAN: High-Resolution Face Attribute Editing with Mask-Guided Generative Adversarial Network

no code implementations3 Oct 2020 Yi Wei, Zhe Gan, Wenbo Li, Siwei Lyu, Ming-Ching Chang, Lei Zhang, Jianfeng Gao, Pengchuan Zhang

We present Mask-guided Generative Adversarial Network (MagGAN) for high-resolution face attribute editing, in which semantic facial masks from a pre-trained face parser are used to guide the fine-grained image editing process.

Attribute Generative Adversarial Network +1

Training Sparse Neural Networks using Compressed Sensing

1 code implementation21 Aug 2020 Jonathan W. Siegel, Jianhong Chen, Pengchuan Zhang, Jinchao Xu

The adaptive weighting we introduce corresponds to a novel regularizer based on the logarithm of the absolute value of the weights.

Novel Human-Object Interaction Detection via Adversarial Domain Generalization

no code implementations22 May 2020 Yuhang Song, Wenbo Li, Lei Zhang, Jianwei Yang, Emre Kiciman, Hamid Palangi, Jianfeng Gao, C. -C. Jay Kuo, Pengchuan Zhang

We study in this paper the problem of novel human-object interaction (HOI) detection, aiming at improving the generalization ability of the model to unseen scenarios.

Domain Generalization Human-Object Interaction Detection +1

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

4 code implementations ECCV 2020 Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiao-Wei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao

Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks.

 Ranked #1 on Image Retrieval on MS COCO (Recall@10 metric)

Image Captioning Image Retrieval +3

Object-Centric Image Generation from Layouts

no code implementations16 Mar 2020 Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R. Devon Hjelm, Shikhar Sharma

In this paper, we start with the idea that a model must be able to understand individual objects and relationships between objects in order to generate complex scenes well.

Generative Adversarial Network Layout-to-Image Generation +1

Statistical Adaptive Stochastic Gradient Methods

1 code implementation25 Feb 2020 Pengchuan Zhang, Hunter Lang, Qiang Liu, Lin Xiao

We propose a statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in stochastic gradient methods.


Statistical Adaptive Stochastic Optimization

no code implementations25 Sep 2019 Pengchuan Zhang, Hunter Lang, Qiang Liu, Lin Xiao

We investigate statistical methods for automatically scheduling the learning rate (step size) in stochastic optimization.

Scheduling Stochastic Optimization

Using Statistics to Automate Stochastic Optimization

no code implementations NeurIPS 2019 Hunter Lang, Pengchuan Zhang, Lin Xiao

Despite the development of numerous adaptive optimizers, tuning the learning rate of stochastic gradient methods remains a major roadblock to obtaining good practical performance in machine learning.

Stochastic Optimization

Object-driven Text-to-Image Synthesis via Adversarial Training

1 code implementation CVPR 2019 Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, Jianfeng Gao

In this paper, we propose Object-driven Attentive Generative Adversarial Newtorks (Obj-GANs) that allow object-centered text-to-image synthesis for complex scenes.

Image Generation Object

A Convex Relaxation Barrier to Tight Robustness Verification of Neural Networks

3 code implementations NeurIPS 2019 Hadi Salman, Greg Yang, huan zhang, Cho-Jui Hsieh, Pengchuan Zhang

This framework works for neural networks with diverse architectures and nonlinearities and covers both primal and dual views of robustness verification.

RecurJac: An Efficient Recursive Algorithm for Bounding Jacobian Matrix of Neural Networks and Its Applications

4 code implementations28 Oct 2018 Huan Zhang, Pengchuan Zhang, Cho-Jui Hsieh

The Jacobian matrix (or the gradient for single-output networks) is directly related to many important properties of neural networks, such as the function landscape, stationary points, (local) Lipschitz constants and robustness to adversarial attacks.

Turbo Learning for Captionbot and Drawingbot

no code implementations NeurIPS 2018 Qiuyuan Huang, Pengchuan Zhang, Dapeng Wu, Lei Zhang

We study in this paper the problems of both image captioning and text-to-image generation, and present a novel turbo learning approach to jointly training an image-to-text generator (a. k. a.

Image Captioning Text Generation +1

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

20 code implementations CVPR 2018 Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, Xiaodong He

In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.

Generative Adversarial Network Image-text matching +2

On the Discrimination-Generalization Tradeoff in GANs

no code implementations ICLR 2018 Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, Xiaodong He

When evaluated with neural distance, our bounds show that generalization is guaranteed as long as the discriminator set is small enough, regardless of the size of the generator or hypothesis set.

Generalization Bounds

A sparse decomposition of low rank symmetric positive semi-definite matrices

1 code implementation3 Jul 2016 Thomas Y. Hou, Qin Li, Pengchuan Zhang

In this paper, we partition the indices from 1 to $N$ into several patches and propose to quantify the sparseness of a vector by the number of patches on which it is nonzero, which is called patch-wise sparseness.

Numerical Analysis

Cannot find the paper you are looking for? You can Submit a new open access paper.