Search Results for author: Zineng Tang

Found 10 papers, 9 papers with code

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

no code implementations • 30 Nov 2023 • Zineng Tang, ZiYi Yang, Mahmoud Khademi, Yang Liu, Chenguang Zhu, Mohit Bansal

We present CoDi-2, a versatile and interactive Multimodal Large Language Model (MLLM) that can follow complex multimodal interleaved instructions, conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any input-output modality paradigm.

Image Generation In-Context Learning +3

Paper
Add Code

Any-to-Any Generation via Composable Diffusion

1 code implementation • NeurIPS 2023 • Zineng Tang, ZiYi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities.

Ranked #7 on Audio Generation on AudioCaps

Audio Generation

1,632

Paper
Code

Paxion: Patching Action Knowledge in Video-Language Foundation Models

1 code implementation • NeurIPS 2023 • Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji

Action knowledge involves the understanding of textual, visual, and temporal aspects of actions.

Ranked #18 on Video Question Answering on NExT-QA (using extra training data)

Action Understanding Object Recognition +1

Paper
Code

Unifying Vision, Text, and Layout for Universal Document Processing

2 code implementations • CVPR 2023 • Zineng Tang, ZiYi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, Mohit Bansal

UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation.

Ranked #5 on Visual Question Answering (VQA) on InfographicVQA (using extra training data)

document understanding Image Reconstruction +1

1,632

Paper
Code

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

1 code implementation • 21 Nov 2022 • Zineng Tang, Jaemin Cho, Jie Lei, Mohit Bansal

We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.

Cross-Modal Retrieval Language Modelling +1

Paper
Code

TVLT: Textless Vision-Language Transformer

2 code implementations • 28 Sep 2022 • Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal

In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR).

Automatic Speech Recognition (ASR) Image Retrieval +6

124,593

Paper
Code

Continuous Language Generative Flow

1 code implementation • ACL 2021 • Zineng Tang, Shiyue Zhang, Hyounghun Kim, Mohit Bansal

Recent years have witnessed various types of generative models for natural language generation (NLG), especially RNNs or transformer based sequence-to-sequence models, as well as variational autoencoder (VAE) and generative adversarial network (GAN) based models.

Data Augmentation Density Estimation +9

Paper
Code

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

1 code implementation • NeurIPS 2021 • Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.

Image Retrieval Knowledge Distillation +6

Paper
Code

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

1 code implementation • NAACL 2021 • Zineng Tang, Jie Lei, Mohit Bansal

Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions.

Question Answering Retrieval +4

Paper
Code

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

1 code implementation • ACL 2020 • Hyounghun Kim, Zineng Tang, Mohit Bansal

Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier.

Image Captioning Multi-Label Classification +3

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.