1 code implementation • ECCV 2020 • Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao
Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection.
Ranked #11 on Video Object Detection on ImageNet VID
1 code implementation • 4 Apr 2024 • Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang
In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos.
1 code implementation • 21 Dec 2023 • Mingfei Han, Linjie Yang, Xiaojie Jin, Jiashi Feng, Xiaojun Chang, Heng Wang
While existing datasets mainly comprise landscape mode videos, our paper seeks to introduce portrait mode videos to the research community and highlight the unique challenges associated with this video format.
1 code implementation • 16 Dec 2023 • Mingfei Han, Linjie Yang, Xiaojun Chang, Heng Wang
A human need to capture both the event in every shot and associate them together to understand the story behind it.
Ranked #1 on video narration captioning on Shot2Story20K
no code implementations • 4 Dec 2023 • Chengyou Jia, Minnan Luo, Xiaojun Chang, Zhuohang Dang, Mingfei Han, Mengmeng Wang, Guang Dai, Sizhe Dang, Jingdong Wang
To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts.
1 code implementation • NeurIPS 2023 • Yuetian Weng, Mingfei Han, Haoyu He, Mingjie Li, Lina Yao, Xiaojun Chang, Bohan Zhuang
By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs.
no code implementations • ICCV 2023 • Mingfei Han, Yali Wang, Zhihui Li, Lina Yao, Xiaojun Chang, Yu Qiao
To tackle this problem, we propose a concise Hybrid Temporal-scale Multimodal Learning (HTML) framework, which can effectively align lingual and visual features to discover core object semantics in the video, by learning multimodal interaction hierarchically from different temporal scales.
Ranked #6 on Referring Video Object Segmentation on Refer-YouTube-VOS (using extra training data)
no code implementations • 21 Jul 2022 • Yuetian Weng, Zizheng Pan, Mingfei Han, Xiaojun Chang, Bohan Zhuang
The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video.
2 code implementations • 16 Jul 2022 • Mingjie Li, Rui Liu, Guangsi Shi, Mingfei Han, Changling Li, Lina Yao, Xiaojun Chang, Ling Chen
To further enhance forecasting accuracy, we introduce a memory-driven decoder.
no code implementations • CVPR 2022 • Mingfei Han, David Junhao Zhang, Yali Wang, Rui Yan, Lina Yao, Xiaojun Chang, Yu Qiao
Learning spatial-temporal relation among multiple actors is crucial for group activity recognition.