Search Results for author: Linjie Li

Found 20 papers, 12 papers with code

Cross-modal Representation Learning for Zero-shot Action Recognition

no code implementations3 May 2022 Chung-Ching Lin, Kevin Lin, Linjie Li, Lijuan Wang, Zicheng Liu

The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent.

Action Recognition Representation Learning +1

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

1 code implementation8 Dec 2021 Yixin Nie, Linjie Li, Zhe Gan, Shuohang Wang, Chenguang Zhu, Michael Zeng, Zicheng Liu, Mohit Bansal, Lijuan Wang

Based on this, we ask an even bolder question: can we have an all-MLP architecture for VL modeling, where both VL fusion and the vision encoder are replaced with MLPs?

Language Modelling Visual Question Answering +1

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

1 code implementation25 Nov 2021 Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, Lijuan Wang

Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames as opposed to previous successes with sparsely sampled video frames for video-and-language understanding tasks (e. g., video question answering).

Frame Question Answering +3

VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling

1 code implementation24 Nov 2021 Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu

Further, unlike previous studies that found pre-training tasks on video inputs (e. g., masked frame modeling) not very effective, we design a new pre-training task, Masked Visual-token Modeling (MVM), for better video modeling.

Frame Question Answering +4

Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models

no code implementations ICCV 2021 Linjie Li, Jie Lei, Zhe Gan, Jingjing Liu

We hope our Adversarial VQA dataset can shed new light on robustness study in the community and serve as a valuable benchmark for future work.

Data Augmentation Question Answering +2

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

1 code implementation CVPR 2021 Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, Jingjing Liu

Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle.

Ranked #4 on Visual Question Answering on MSRVTT-QA (using extra training data)

Question Answering Text to Video Retrieval +3

A Closer Look at the Robustness of Vision-and-Language Pre-trained Models

no code implementations15 Dec 2020 Linjie Li, Zhe Gan, Jingjing Liu

Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have propelled the state of the art in vision-and-language (V+L) research to a new level.

Graph Optimal Transport for Cross-Domain Alignment

1 code implementation ICML 2020 Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, Jingjing Liu

In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph.

Graph Matching Image Captioning +5

Meta Module Network for Compositional Visual Reasoning

1 code implementation8 Oct 2019 Wenhu Chen, Zhe Gan, Linjie Li, Yu Cheng, William Wang, Jingjing Liu

To design a more powerful NMN architecture for practical use, we propose Meta Module Network (MMN) centered on a novel meta module, which can take in function recipes and morph into diverse instance modules dynamically.

Visual Reasoning

UNITER: Learning UNiversal Image-TExt Representations

no code implementations25 Sep 2019 Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding.

Language Modelling Masked Language Modeling +7

UNITER: UNiversal Image-TExt Representation Learning

5 code implementations ECCV 2020 Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i. e., masked language/region modeling is conditioned on full observation of image/text).

Language Modelling Masked Language Modeling +9

Relation-Aware Graph Attention Network for Visual Question Answering

1 code implementation ICCV 2019 Linjie Li, Zhe Gan, Yu Cheng, Jingjing Liu

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects.

Graph Attention Question Answering +2

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

no code implementations ACL 2019 Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, Jianfeng Gao

This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image.

Question Answering Visual Dialog

Learning to see people like people

no code implementations5 May 2017 Amanda Song, Linjie Li, Chad Atalla, Garrison Cottrell

Humans make complex inferences on faces, ranging from objective properties (gender, ethnicity, expression, age, identity, etc) to subjective judgments (facial attractiveness, trustworthiness, sociability, friendliness, etc).

Cannot find the paper you are looking for? You can Submit a new open access paper.