Search Results for author: Xudong Lin

Found 37 papers, 16 papers with code

RESIN-11: Schema-guided Event Prediction for 11 Newsworthy Scenarios

1 code implementation • NAACL (ACL) 2022 • Xinya Du, Zixuan Zhang, Sha Li, Pengfei Yu, Hongwei Wang, Tuan Lai, Xudong Lin, Ziqi Wang, Iris Liu, Ben Zhou, Haoyang Wen, Manling Li, Darryl Hannan, Jie Lei, Hyounghun Kim, Rotem Dror, Haoyu Wang, Michael Regan, Qi Zeng, Qing Lyu, Charles Yu, Carl Edwards, Xiaomeng Jin, Yizhu Jiao, Ghazaleh Kazeminejad, Zhenhailong Wang, Chris Callison-Burch, Mohit Bansal, Carl Vondrick, Jiawei Han, Dan Roth, Shih-Fu Chang, Martha Palmer, Heng Ji

We introduce RESIN-11, a new schema-guided event extraction&prediction framework that can be applied to a large variety of newsworthy scenarios.

Event Extraction

Paper
Code

Coreference by Appearance: Visually Grounded Event Coreference Resolution

no code implementations • CRAC (ACL) 2021 • Liming Wang, Shengyu Feng, Xudong Lin, Manling Li, Heng Ji, Shih-Fu Chang

Event coreference resolution is critical to understand events in the growing number of online news with multiple modalities including text, video, speech, etc.

coreference-resolution Event Coreference Resolution +2

Paper
Add Code

BLINK: Multimodal Large Language Models Can See but Not Perceive

no code implementations • 18 Apr 2024 • Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations.

Paper
Add Code

SCHEMA: State CHangEs MAtter for Procedure Planning in Instructional Videos

no code implementations • 3 Mar 2024 • Yulei Niu, Wenliang Guo, Long Chen, Xudong Lin, Shih-Fu Chang

We study the problem of procedure planning in instructional videos, which aims to make a goal-oriented sequence of action steps given partial visual state observations.

Contrastive Learning

Paper
Add Code

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

no code implementations • 10 Jan 2024 • Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang

In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions.

Multimodal Reasoning

Paper
Add Code

Video Summarization: Towards Entity-Aware Captions

no code implementations • 1 Dec 2023 • Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang

We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task.

Image Captioning Video Captioning +2

Paper
Add Code

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

no code implementations • 20 Nov 2023 • Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang, Hongxia Yang

To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks.

Paper
Add Code

Non-Sequential Graph Script Induction via Multimedia Grounding

1 code implementation • 27 May 2023 • Yu Zhou, Sha Li, Manling Li, Xudong Lin, Shih-Fu Chang, Mohit Bansal, Heng Ji

To automate the induction of such graph scripts for given tasks, we propose to take advantage of loosely aligned videos of people performing the tasks.

Paper
Code

Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

no code implementations • 7 Apr 2023 • Hung-Ting Su, Yulei Niu, Xudong Lin, Winston H. Hsu, Shih-Fu Chang

Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video.

Question Answering Question Generation +3

Paper
Add Code

Supervised Masked Knowledge Distillation for Few-Shot Transformers

1 code implementation • CVPR 2023 • Han Lin, Guangxing Han, Jiawei Ma, Shiyuan Huang, Xudong Lin, Shih-Fu Chang

Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features.

Few-Shot Learning Inductive Bias +1

Paper
Code

In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

no code implementations • 6 Jan 2023 • Andrew Lu, Xudong Lin, Yulei Niu, Shih-Fu Chang

Understanding event relationships in videos requires a model to understand the underlying structures of events (i. e. the event type, the associated argument roles, and corresponding entities) and factual knowledge for reasoning.

Relation

Paper
Add Code

TempCLR: Temporal Alignment Representation with Contrastive Learning

1 code implementation • 28 Dec 2022 • Yuncong Yang, Jiawei Ma, Shiyuan Huang, Long Chen, Xudong Lin, Guangxing Han, Shih-Fu Chang

For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly.

Ranked #2 on Long Video Retrieval (Background Removed) on YouCook2

Contrastive Learning Dynamic Time Warping +7

Paper
Code

Video Event Extraction via Tracking Visual States of Arguments

no code implementations • 3 Nov 2022 • Guang Yang, Manling Li, Jiajie Zhang, Xudong Lin, Shih-Fu Chang, Heng Ji

Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles.

Event Extraction

Paper
Add Code

Weakly-Supervised Temporal Article Grounding

1 code implementation • 22 Oct 2022 • Long Chen, Yulei Niu, Brian Chen, Xudong Lin, Guangxing Han, Christopher Thomas, Hammad Ayyubi, Heng Ji, Shih-Fu Chang

Specifically, given an article and a relevant video, WSAG aims to localize all ``groundable'' sentences to the video, and these sentences are possibly at different semantic scales.

Natural Language Queries Sentence +1

Paper
Code

Learning to Decompose Visual Features with Latent Textual Prompts

no code implementations • 9 Oct 2022 • Feng Wang, Manling Li, Xudong Lin, Hairong Lv, Alexander G. Schwing, Heng Ji

Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations.

Retrieval

Paper
Add Code

Beyond Grounding: Extracting Fine-Grained Event Hierarchies Across Modalities

no code implementations • 14 Jun 2022 • Hammad A. Ayyubi, Christopher Thomas, Lovish Chum, Rahul Lokesh, Long Chen, Yulei Niu, Xudong Lin, Xuande Feng, Jaywon Koo, Sounak Ray, Shih-Fu Chang

To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset.

Paper
Add Code

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

1 code implementation • CVPR 2023 • Xudong Lin, Simran Tiwari, Shiyuan Huang, Manling Li, Mike Zheng Shou, Heng Ji, Shih-Fu Chang

We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data.

Ranked #1 on Video Question Answering on iVQA

Retrieval Sentence +2

Paper
Code

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

1 code implementation • 22 May 2022 • Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, ZiYi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction.

Attribute Automatic Speech Recognition +6

110

Paper
Code

Revitalize Region Feature for Democratizing Video-Language Pre-training of Retrieval

2 code implementations • 15 Mar 2022 • Guanyu Cai, Yixiao Ge, Binjie Zhang, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, XiaoHu Qie, Jianping Wu, Mike Zheng Shou

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language retrieval.

Question Answering Retrieval +4

Paper
Code

All in One: Exploring Unified Video-Language Pre-training

1 code implementation • CVPR 2023 • Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Ranked #6 on TGIF-Transition on TGIF-QA (using extra training data)

Language Modelling Multiple-choice +10

272

Paper
Code

Learning To Recognize Procedural Activities with Distant Supervision

1 code implementation • CVPR 2022 • Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.

Ranked #3 on Video Classification on Breakfast

Action Classification Language Modelling +1

Paper
Code

CLIP-Event: Connecting Text and Images with Event Structures

1 code implementation • CVPR 2022 • Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, Shih-Fu Chang

Vision-language (V+L) pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text.

Contrastive Learning Event Extraction +2

Paper
Code

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

2 code implementations • 20 Dec 2021 • Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander Schwing, Heng Ji

Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question.

Answer Generation Data Augmentation +2

697

Paper
Code

Video-Text Pre-training with Learned Regions

1 code implementation • 2 Dec 2021 • Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng Wang, Xudong Lin, Guanyu Cai, Jinhui Tang

Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information.

Representation Learning Retrieval +2

Paper
Code

Object-aware Video-language Pre-training for Retrieval

1 code implementation • CVPR 2022 • Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.

Ranked #20 on Zero-Shot Video Retrieval on DiDeMo

Object Retrieval +2

Paper
Code

Joint Multimedia Event Extraction from Video and Article

no code implementations • Findings (EMNLP) 2021 • Brian Chen, Xudong Lin, Christopher Thomas, Manling Li, Shoya Yoshida, Lovish Chum, Heng Ji, Shih-Fu Chang

We introduce the new task of Video MultiMedia Event Extraction (Video M2E2) and propose two novel components to build the first system towards this task.

coreference-resolution Event Coreference Resolution +1

Paper
Add Code

RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System

1 code implementation • NAACL 2021 • Haoyang Wen, Ying Lin, Tuan Lai, Xiaoman Pan, Sha Li, Xudong Lin, Ben Zhou, Manling Li, Haoyu Wang, Hongming Zhang, Xiaodong Yu, Alexander Dong, Zhenhailong Wang, Yi Fung, Piyush Mishra, Qing Lyu, D{\'\i}dac Sur{\'\i}s, Brian Chen, Susan Windisch Brown, Martha Palmer, Chris Callison-Burch, Carl Vondrick, Jiawei Han, Dan Roth, Shih-Fu Chang, Heng Ji

We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video).

coreference-resolution Event Extraction +1

Paper
Code

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

no code implementations • CVPR 2021 • Sijie Song, Xudong Lin, Jiaying Liu, Zongming Guo, Shih-Fu Chang

In this paper, we address the problem of referring expression comprehension in videos, which is challenging due to complex expression and scene dynamics.

Referring Expression Referring Expression Comprehension +1

Paper
Add Code

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

no code implementations • CVPR 2021 • Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani

We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.

Question Answering Text Generation

Paper
Add Code

Flow-Distilled IP Two-Stream Networks for Compressed Video Action Recognition

no code implementations • 10 Dec 2019 • Shiyuan Huang, Xudong Lin, Svebor Karaman, Shih-Fu Chang

Recent works instead use modern compressed video modalities as an alternative to the RGB spatial stream and improve the inference speed by orders of magnitudes.

Action Recognition Optical Flow Estimation +3

Paper
Add Code

Towards Train-Test Consistency for Semi-supervised Temporal Action Localization

no code implementations • 24 Oct 2019 • Xudong Lin, Zheng Shou, Shih-Fu Chang

The inconsistent strategy makes it hard to explicitly supervise the action localization model with temporal boundary annotations at training time.

Multiple Instance Learning Video Classification +2

Paper
Add Code

Context-Gated Convolution

1 code implementation • ECCV 2020 • Xudong Lin, Lin Ma, Wei Liu, Shih-Fu Chang

As such, being aware of the global context, the modulated convolution kernel of our proposed CGC can better extract representative local patterns and compose discriminative features.

Ranked #61 on Image Classification on ObjectNet (using extra training data)

Action Recognition Image Classification +1

Paper
Code

Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval

no code implementations • 4 Mar 2019 • Svebor Karaman, Xudong Lin, Xuefeng Hu, Shih-Fu Chang

We propose an unsupervised hashing method which aims to produce binary codes that preserve the ranking induced by a real-valued representation.

Image Retrieval Re-Ranking +1

Paper
Add Code

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition

no code implementations • CVPR 2019 • Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, Zhicheng Yan

Motion has shown to be useful for video understanding, where motion is typically represented by optical flow.

Ranked #1 on Action Recognition on UCF-101

Action Classification Action Recognition In Videos +3

Paper
Add Code

Deep Variational Metric Learning

no code implementations • ECCV 2018 • Xudong Lin, Yueqi Duan, Qiyuan Dong, Jiwen Lu, Jie zhou

Deep metric learning has been extensively explored recently, which trains a deep neural network to produce discriminative embedding features.

Metric Learning

Paper
Add Code

GraphBit: Bitwise Interaction Mining via Deep Reinforcement Learning

no code implementations • CVPR 2018 • Yueqi Duan, Ziwei Wang, Jiwen Lu, Xudong Lin, Jie zhou

Specifically, we design a deep reinforcement learning model to learn the structure of the graph for bitwise interaction mining, reducing the uncertainty of binary codes by maximizing the mutual information with inputs and related bits, so that the ambiguous bits receive additional instruction from the graph for confident binarization.

Binarization reinforcement-learning +2

Paper
Add Code

Deep Adversarial Metric Learning

no code implementations • CVPR 2018 • Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, Jie zhou

Learning an effective distance metric between image pairs plays an important role in visual analysis, where the training procedure largely relies on hard negative samples.

Metric Learning

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.