Search Results for author: Lei Ji

Found 31 papers, 16 papers with code

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

1 code implementation13 May 2024 Zeyu Wang, Yuanchun Shi, Yuntao Wang, Yuchen Yao, Kun Yan, YuHan Wang, Lei Ji, Xuhai Xu, Chun Yu

Modern information querying systems are progressively incorporating multimodal inputs like vision and audio.

Natural Language Queries

Exploring Diffusion Time-steps for Unsupervised Representation Learning

1 code implementation21 Jan 2024 Zhongqi Yue, Jiankun Wang, Qianru Sun, Lei Ji, Eric I-Chao Chang, Hanwang Zhang

Representation learning is all about discovering the hidden modular attributes that generate the data faithfully.

Attribute counterfactual +3

Voila-A: Aligning Vision-Language Models with User's Gaze Attention

no code implementations22 Dec 2023 Kun Yan, Lei Ji, Zeyu Wang, Yuntao Wang, Nan Duan, Shuai Ma

In this paper, we introduce gaze information, feasibly collected by AR or VR devices, as a proxy for human attention to guide VLMs and propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.

EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images

3 code implementations NeurIPS 2023 Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, JungWoo Oh, Lei Ji, Eric I-Chao Chang, Tackeun Kim, Edward Choi

To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset.

Decision Making Medical Visual Question Answering +2

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

1 code implementation27 Jun 2023 Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou

Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data.

Natural Language Queries

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs

no code implementations29 Mar 2023 Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, Nan Duan

On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well.

Code Generation Common Sense Reasoning +1

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

1 code implementation CVPR 2023 Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must.

Question Answering Video Question Answering +2

HORIZON: High-Resolution Semantically Controlled Panorama Synthesis

no code implementations10 Oct 2022 Kun Yan, Lei Ji, Chenfei Wu, Jian Liang, Ming Zhou, Nan Duan, Shuai Ma

Panorama synthesis endeavors to craft captivating 360-degree visual landscapes, immersing users in the heart of virtual worlds.

Vocal Bursts Intensity Prediction

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

1 code implementation22 Sep 2022 Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.

Contrastive Learning Video Grounding

ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale Fusion of Locally Descriptors

no code implementations2 Dec 2021 Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, Tianrui Li

This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio with shared Vectors of Locally Aggregated Descriptors to improve unaligned multimodal sentiment analysis.

Multimodal Sentiment Analysis

Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering

no code implementations NeurIPS 2021 Weijiang Yu, Haoteng Zheng, Mengfei Li, Lei Ji, Lijun Wu, Nong Xiao, Nan Duan

To consider the interdependent knowledge between contextual clips into the network inference, we propose a Siamese Sampling and Reasoning (SiaSamRea) approach, which consists of a siamese sampling mechanism to generate sparse and similar clips (i. e., siamese clips) from the same video, and a novel reasoning strategy for integrating the interdependent knowledge between contextual clips into the network.

Multimodal Reasoning Question Answering +1

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

1 code implementation24 Nov 2021 Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan

To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively.

Decoder Text-to-Image Generation +3

Hybrid Reasoning Network for Video-based Commonsense Captioning

1 code implementation5 Aug 2021 Weijiang Yu, Jian Liang, Lei Ji, Lu Li, Yuejian Fang, Nong Xiao, Nan Duan

Firstly, we develop multi-commonsense learning for semantic-level reasoning by jointly training different commonsense types in a unified network, which encourages the interaction between the clues of multiple commonsense descriptions, event-wise captions and videos.

Attribute Decoder

Control Image Captioning Spatially and Temporally

no code implementations ACL 2021 Kun Yan, Lei Ji, Huaishao Luo, Ming Zhou, Nan Duan, Shuai Ma

Moreover, the controllability and explainability of LoopCAG are validated by analyzing spatial and temporal sensitivity during the generation process.

Contrastive Learning Image Captioning +1

Hierarchical Context-aware Network for Dense Video Event Captioning

1 code implementation ACL 2021 Lei Ji, Xianglin Guo, Haoyang Huang, Xilin Chen

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video.


GEM: A General Evaluation Benchmark for Multimodal Tasks

1 code implementation Findings (ACL) 2021 Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun Sacheti

Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

1 code implementation30 Apr 2021 Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, Nan Duan

Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation.

Ranked #16 on Text-to-Video Generation on MSR-VTT (CLIPSIM metric)

Text-to-Video Generation Video Generation

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

5 code implementations18 Apr 2021 Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li

In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.

Retrieval Text Retrieval +4

XGPT: Cross-modal Generative Pre-Training for Image Captioning

no code implementations3 Mar 2020 Qiaolin Xia, Haoyang Huang, Nan Duan, Dong-dong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly.

Data Augmentation Denoising +7

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

2 code implementations15 Feb 2020 Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou

However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.

Ranked #2 on Action Segmentation on COIN (using extra training data)

Action Segmentation Decoder +3

Knowledge Aware Semantic Concept Expansion for Image-Text Matching

no code implementations International Joint Conferences on Artifical Intelligence (IJCAI) 2019 Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, Nan Duan

In this paper, we develop a Scene Concept Graph (SCG) by aggregating image scene graphs and extracting frequently co-occurred concept pairs as scene common-sense knowledge.

Common Sense Reasoning Content-Based Image Retrieval +3

Dense Procedure Captioning in Narrated Instructional Videos

no code implementations ACL 2019 Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, Ming Zhou

Understanding narrated instructional videos is important for both research and real-world web applications.

Dense Captioning

Cannot find the paper you are looking for? You can Submit a new open access paper.