Search Results for author: Liunian Harold Li

Found 12 papers, 5 papers with code

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

no code implementations19 Apr 2022 Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Yong Jae Lee, Houdong Hu, Zicheng Liu, Jianfeng Gao

A variety of evaluation metrics are used, including sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning).

Fairness Image Classification +1

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

no code implementations16 Dec 2021 Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, Shih-Fu Chang

As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph.

Visual Commonsense Reasoning

RegionCLIP: Region-based Language-Image Pretraining

no code implementations16 Dec 2021 Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao

However, we show that directly applying such models to recognize image regions for object detection leads to poor performance due to a domain shift: CLIP was trained to match an image as a whole to a text description, without capturing the fine-grained alignment between image regions and text spans.

Image Classification Object Detection +1

Grounded Language-Image Pre-training

1 code implementation7 Dec 2021 Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao

The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich.

 Ranked #1 on Phrase Grounding on Flickr30k Entities Test (using extra training data)

Object Detection Phrase Grounding

How Much Can CLIP Benefit Vision-and-Language Tasks?

2 code implementations13 Jul 2021 Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Ranked #5 on Visual Entailment on SNLI-VE val (using extra training data)

Question Answering Visual Entailment +1

What Does BERT with Vision Look At?

no code implementations ACL 2020 Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang

Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER have achieved significant performance improvement on vision-and-language tasks but what they learn during pre-training remains unclear.

Language Modelling

Efficient Contextual Representation Learning Without Softmax Layer

no code implementations28 Feb 2019 Liunian Harold Li, Patrick H. Chen, Cho-Jui Hsieh, Kai-Wei Chang

Our framework reduces the time spent on the output layer to a negligible level, eliminates almost all the trainable parameters of the softmax layer and performs language modeling without truncating the vocabulary.

Dimensionality Reduction Language Modelling +1

Cannot find the paper you are looking for? You can Submit a new open access paper.