no code implementations • 29 Mar 2023 • Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, Yun Wang, Linjun Shou, Ming Gong, Nan Duan
On the other hand, there are also many existing models and systems (symbolic-based or neural-based) that can do some domain-specific tasks very well.
1 code implementation • 19 Dec 2022 • Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou
To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must.
no code implementations • 16 Nov 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
This technical report describes the CONE approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022.
no code implementations • 10 Oct 2022 • Kun Yan, Lei Ji, Chenfei Wu, Jian Liang, Ming Zhou, Nan Duan, Shuai Ma
Panorama synthesis aims to generate a visual scene with all 360-degree views and enables an immersive virtual world.
no code implementations • 22 Sep 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
Analysis reveals the effectiveness of components and higher efficiency in long video grounding as our system improves the inference speed by 2x on Ego4d-NLQ and 15x on MAD while keeping the SOTA performance of CONE.
no code implementations • 2 Dec 2021 • Huaishao Luo, Lei Ji, Yanyong Huang, Bin Wang, Shenggong Ji, Tianrui Li
This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio with shared Vectors of Locally Aggregated Descriptors to improve unaligned multimodal sentiment analysis.
no code implementations • NeurIPS 2021 • Weijiang Yu, Haoteng Zheng, Mengfei Li, Lei Ji, Lijun Wu, Nong Xiao, Nan Duan
To consider the interdependent knowledge between contextual clips into the network inference, we propose a Siamese Sampling and Reasoning (SiaSamRea) approach, which consists of a siamese sampling mechanism to generate sparse and similar clips (i. e., siamese clips) from the same video, and a novel reasoning strategy for integrating the interdependent knowledge between contextual clips into the network.
1 code implementation • 24 Nov 2021 • Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, Nan Duan
To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively.
Ranked #1 on Text-to-Video Generation on Kinetics
1 code implementation • 5 Aug 2021 • Weijiang Yu, Jian Liang, Lei Ji, Lu Li, Yuejian Fang, Nong Xiao, Nan Duan
Firstly, we develop multi-commonsense learning for semantic-level reasoning by jointly training different commonsense types in a unified network, which encourages the interaction between the clues of multiple commonsense descriptions, event-wise captions and videos.
1 code implementation • ACL 2021 • Lei Ji, Xianglin Guo, Haoyang Huang, Xilin Chen
Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video.
no code implementations • ACL 2021 • Kun Yan, Lei Ji, Huaishao Luo, Ming Zhou, Nan Duan, Shuai Ma
Moreover, the controllability and explainability of LoopCAG are validated by analyzing spatial and temporal sensitivity during the generation process.
Ranked #1 on Image Captioning on Localized Narratives
1 code implementation • Findings (ACL) 2021 • Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun Sacheti
Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.
1 code implementation • 30 Apr 2021 • Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, Nan Duan
Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation.
Ranked #4 on Text-to-Video Generation on MSR-VTT
4 code implementations • 18 Apr 2021 • Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, Tianrui Li
In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
Ranked #1 on Text to Video Retrieval on MSR-VTT
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Huaishao Luo, Lei Ji, Tianrui Li, Nan Duan, Daxin Jiang
Specifically, a cascaded labeling module is developed to enhance the interchange between aspect terms and improve the attention of sentiment tokens when labeling sentiment polarities.
no code implementations • 16 Sep 2020 • Martin Kuo, Yaobo Liang, Lei Ji, Nan Duan, Linjun Shou, Ming Gong, Peng Chen
The semi-structured answer has two advantages which are more readable and falsifiable compared to span answer.
1 code implementation • EMNLP (nlpbt) 2020 • Frank F. Xu, Lei Ji, Botian Shi, Junyi Du, Graham Neubig, Yonatan Bisk, Nan Duan
Watching instructional videos are often used to learn about procedures.
no code implementations • 3 Mar 2020 • Qiaolin Xia, Haoyang Huang, Nan Duan, Dong-dong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou
While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly.
2 code implementations • 15 Feb 2020 • Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Jason Li, Taroon Bharti, Ming Zhou
However, most of the existing multimodal models are pre-trained for understanding tasks, leading to a pretrain-finetune discrepancy for generation tasks.
Ranked #1 on Action Segmentation on COIN (using extra training data)
no code implementations • International Joint Conferences on Artifical Intelligence (IJCAI) 2019 • Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, Nan Duan
In this paper, we develop a Scene Concept Graph (SCG) by aggregating image scene graphs and extracting frequently co-occurred concept pairs as scene common-sense knowledge.
no code implementations • ACL 2019 • Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, Ming Zhou
Understanding narrated instructional videos is important for both research and real-world web applications.
1 code implementation • 24 May 2018 • Pan Lu, Lei Ji, Wei zhang, Nan Duan, Ming Zhou, Jianyong Wang
To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA.
Ranked #3 on Visual Question Answering (VQA) on COCO Visual Question Answering (VQA) real images 1.0 multiple choice