Search Results for author: Lei Ji

Found 30 papers, 15 papers with code

Hashing based Efficient Inference for Image-Text Matching

no code implementations • Findings (ACL) 2021 • Rong-Cheng Tu, Lei Ji, Huaishao Luo, Botian Shi, Heyan Huang, Nan Duan, Xian-Ling Mao

Image-text matching Text Matching

Paper
Add Code

Exploring Diffusion Time-steps for Unsupervised Representation Learning

1 code implementation • 21 Jan 2024 • Zhongqi Yue, Jiankun Wang, Qianru Sun, Lei Ji, Eric I-Chao Chang, Hanwang Zhang

Representation learning is all about discovering the hidden modular attributes that generate the data faithfully.

Attribute counterfactual +3

Paper
Code

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

no code implementations • 22 Dec 2023 • Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma

In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.

Paper
Add Code

ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation

no code implementations • 20 Dec 2023 • Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou

Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks, thereby boosting human productivity.

Language Modelling Large Language Model

Paper
Add Code

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

2 code implementations • NeurIPS 2023 • Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, JungWoo Oh, Lei Ji, Eric I-Chao Chang, Tackeun Kim, Edward Choi

To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset.

Decision Making Medical Visual Question Answering +2

Paper
Code

KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report Summarization

no code implementations • 10 Jul 2023 • Gangwoo Kim, Hajung Kim, Lei Ji, Seongsu Bae, Chanhwi Kim, Mujeen Sung, Hyunjae Kim, Kun Yan, Eric Chang, Jaewoo Kang

In this paper, we introduce CheXOFA, a new pre-trained vision-language model (VLM) for the chest X-ray domain.

Language Modelling

Paper
Add Code

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

1 code implementation • 27 Jun 2023 • Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou

Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data.

Natural Language Queries

Paper
Code

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

no code implementations • 14 Jun 2023 • Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, Mike Zheng Shou

2) Flexible inputs and intermediate results.

Paper
Add Code

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

no code implementations • 29 Mar 2023 • Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, Nan Duan

On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well.

Code Generation Common Sense Reasoning +1

Paper
Add Code

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

1 code implementation • CVPR 2023 • Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must.

Ranked #2 on Video Question Answering on AGQA 2.0 balanced

Question Answering Video Question Answering +2

Paper
Code

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

no code implementations • 16 Nov 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan

This technical report describes the CONE approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022.

Contrastive Learning Natural Language Queries

Paper
Add Code

HORIZON: High-Resolution Semantically Controlled Panorama Synthesis

no code implementations • 10 Oct 2022 • Kun Yan, Lei Ji, Chenfei Wu, Jian Liang, Ming Zhou, Nan Duan, Shuai Ma

Panorama synthesis endeavors to craft captivating 360-degree visual landscapes, immersing users in the heart of virtual worlds.

Vocal Bursts Intensity Prediction

Paper
Add Code

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

1 code implementation • 22 Sep 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.

Contrastive Learning Video Grounding

Paper
Code

ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale Fusion of Locally Descriptors

no code implementations • 2 Dec 2021 • Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, Tianrui Li

This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio with shared Vectors of Locally Aggregated Descriptors to improve unaligned multimodal sentiment analysis.

Multimodal Sentiment Analysis

Paper
Add Code

Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering

no code implementations • NeurIPS 2021 • Weijiang Yu, Haoteng Zheng, Mengfei Li, Lei Ji, Lijun Wu, Nong Xiao, Nan Duan

To consider the interdependent knowledge between contextual clips into the network inference, we propose a Siamese Sampling and Reasoning (SiaSamRea) approach, which consists of a siamese sampling mechanism to generate sparse and similar clips (i. e., siamese clips) from the same video, and a novel reasoning strategy for integrating the interdependent knowledge between contextual clips into the network.

Multimodal Reasoning Question Answering +1

Paper
Add Code

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

1 code implementation • 24 Nov 2021 • Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan

To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively.

Ranked #1 on Text-to-Video Generation on Kinetics

Text-to-Image Generation Text-to-Video Generation +2

533

Paper
Code

Hybrid Reasoning Network for Video-based Commonsense Captioning

1 code implementation • 5 Aug 2021 • Weijiang Yu, Jian Liang, Lei Ji, Lu Li, Yuejian Fang, Nong Xiao, Nan Duan

Firstly, we develop multi-commonsense learning for semantic-level reasoning by jointly training different commonsense types in a unified network, which encourages the interaction between the clues of multiple commonsense descriptions, event-wise captions and videos.

Attribute

Paper
Code

Control Image Captioning Spatially and Temporally

no code implementations • ACL 2021 • Kun Yan, Lei Ji, Huaishao Luo, Ming Zhou, Nan Duan, Shuai Ma

Moreover, the controllability and explainability of LoopCAG are validated by analyzing spatial and temporal sensitivity during the generation process.

Ranked #1 on Image Captioning on Localized Narratives

Contrastive Learning Image Captioning +1

Paper
Add Code

Hierarchical Context-aware Network for Dense Video Event Captioning

1 code implementation • ACL 2021 • Lei Ji, Xianglin Guo, Haoyang Huang, Xilin Chen

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video.

Descriptive

Paper
Code

GEM: A General Evaluation Benchmark for Multimodal Tasks

1 code implementation • Findings (ACL) 2021 • Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun Sacheti

Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.

Paper
Code

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

1 code implementation • 30 Apr 2021 • Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, Nan Duan

Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation.

Ranked #16 on Text-to-Video Generation on MSR-VTT (CLIPSIM metric)

Text-to-Video Generation Video Generation

Paper
Code

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

5 code implementations • 18 Apr 2021 • Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Ranked #1 on Text to Video Retrieval on MSR-VTT

Retrieval Text Retrieval +4

2,968

Paper
Code

GRACE: Gradient Harmonized and Cascaded Labeling for Aspect-based Sentiment Analysis

1 code implementation • Findings of the Association for Computational Linguistics 2020 • Huaishao Luo, Lei Ji, Tianrui Li, Nan Duan, Daxin Jiang

Specifically, a cascaded labeling module is developed to enhance the interchange between aspect terms and improve the attention of sentiment tokens when labeling sentiment polarities.

Ranked #2 on Sentiment Analysis on SemEval 2014 Task 4 Subtask 1+2

Aspect-Based Sentiment Analysis Aspect-Based Sentiment Analysis (ABSA) +4

Paper
Code

Tag and Correct: Question aware Open Information Extraction with Two-stage Decoding

no code implementations • 16 Sep 2020 • Martin Kuo, Yaobo Liang, Lei Ji, Nan Duan, Linjun Shou, Ming Gong, Peng Chen

The semi-structured answer has two advantages which are more readable and falsifiable compared to span answer.

Open Information Extraction TAG

Paper
Add Code

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

1 code implementation • EMNLP (nlpbt) 2020 • Frank F. Xu, Lei Ji, Botian Shi, Junyi Du, Graham Neubig, Yonatan Bisk, Nan Duan

Watching instructional videos are often used to learn about procedures.

Action Detection Semantic Role Labeling +2

Paper
Code

XGPT: Cross-modal Generative Pre-Training for Image Captioning

no code implementations • 3 Mar 2020 • Qiaolin Xia, Haoyang Huang, Nan Duan, Dong-dong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly.

Data Augmentation Denoising +7

Paper
Add Code

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

2 code implementations • 15 Feb 2020 • Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou

However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.

Ranked #2 on Action Segmentation on COIN (using extra training data)

Action Segmentation Language Modelling +2

327

Paper
Code

Knowledge Aware Semantic Concept Expansion for Image-Text Matching

no code implementations • International Joint Conferences on Artifical Intelligence (IJCAI) 2019 • Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, Nan Duan

In this paper, we develop a Scene Concept Graph (SCG) by aggregating image scene graphs and extracting frequently co-occurred concept pairs as scene common-sense knowledge.

Common Sense Reasoning Content-Based Image Retrieval +3

Paper
Add Code

Dense Procedure Captioning in Narrated Instructional Videos

no code implementations • ACL 2019 • Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, Ming Zhou

Understanding narrated instructional videos is important for both research and real-world web applications.

Dense Captioning

Paper
Add Code

R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering

1 code implementation • 24 May 2018 • Pan Lu, Lei Ji, Wei zhang, Nan Duan, Ming Zhou, Jianyong Wang

To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA.

Ranked #3 on Visual Question Answering (VQA) on COCO Visual Question Answering (VQA) real images 1.0 multiple choice

Question Answering Relation +3

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.