1 code implementation • 6 Mar 2025 • Yifei HUANG, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Mingfang Zhang, Lijin Yang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Xinyuan Chen, Yaohui Wang, Yali Wang, Yu Qiao, LiMin Wang
We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices.
1 code implementation • 2 Mar 2025 • Baoqi Pei, Yifei HUANG, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, LiMin Wang
However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects.
no code implementations • 16 Feb 2025 • Runtian Yuan, Jilan Xu, Mohan Chen, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao
We develop a strong baseline model, Text-Promptable Propagation (TPP), designed to exploit the intrinsic relationships among sequential images and their associated textual descriptions.
1 code implementation • 30 Dec 2024 • Yifei HUANG, Jilan Xu, Baoqi Pei, Yuping He, Guo Chen, Lijin Yang, Xinyuan Chen, Yaohui Wang, Zheng Nie, Jinyao Liu, Guoshun Fan, Dechen Lin, Fang Fang, Kunpeng Li, Chang Yuan, Yali Wang, Yu Qiao, LiMin Wang
We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model.
no code implementations • 16 Dec 2024 • Guo Chen, Yicheng Liu, Yifei HUANG, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, LiMin Wang
However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content.
1 code implementation • 26 Jun 2024 • Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei HUANG, Yali Wang, Tong Lu, LiMin Wang, Yu Qiao
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.
Ranked #1 on
Long Term Action Anticipation
on Ego4D
(using extra training data)
no code implementations • 9 Apr 2024 • Junlin Hou, Jilan Xu, Hao Chen
In the former branch, we train a convolutional neural network (CNN) with an inserted CAW layer to perform skin lesion diagnosis.
no code implementations • 8 Apr 2024 • Junlin Hou, Jilan Xu, Rui Feng, Hao Chen
Previous noise learning methods mainly considered noise arising from images being mislabeled, i. e. label noise, assuming that all mislabeled images are of high image quality.
1 code implementation • CVPR 2024 • Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao
Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.
Ranked #1 on
Action Anticipation
on EgoExoLearn
(using extra training data)
2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue.
Ranked #1 on
Action Classification
on MIT
no code implementations • 18 Mar 2024 • Runtian Yuan, Qingqiu Li, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen
In response to the need for rapid and accurate COVID-19 diagnosis during the global pandemic, we present a two-stage framework that leverages pseudo labels for domain adaptation to enhance the detection of COVID-19 from CT scans.
no code implementations • 18 Mar 2024 • Qingqiu Li, Runtian Yuan, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen
To make a more accurate diagnosis of COVID-19, we propose a straightforward yet effective model.
1 code implementation • 14 Mar 2024 • Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang
We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.
Ranked #2 on
Temporal Action Localization
on FineAction
no code implementations • 14 Mar 2024 • Qingqiu Li, Xiaohan Yan, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Shujun Wang
For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs.
no code implementations • CVPR 2024 • Jilan Xu, Yifei HUANG, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos.
3 code implementations • CVPR 2024 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao
With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.
no code implementations • 1 Nov 2023 • Qingqiu Li, Jilan Xu, Runtian Yuan, Mohan Chen, Yuejie Zhang, Rui Feng, Xiaobo Zhang, Shang Gao
Automatic generation of radiology reports holds crucial clinical value, as it can alleviate substantial workload on radiologists and remind less experienced ones of potential anomalies.
1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang
Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.
no code implementations • 1 Apr 2023 • Fenggang Liu, Yangguang Li, Feng Liang, Jilan Xu, Bin Huang, Jing Shao
We mask part of patches in the representation space and then utilize sparse visible patches to reconstruct high semantic image representation.
1 code implementation • CVPR 2023 • Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie
The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities.
Open Vocabulary Semantic Segmentation
Open-Vocabulary Semantic Segmentation
+1
2 code implementations • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Ranked #1 on
Action Recognition
on Something-Something V1
(using extra training data)
1 code implementation • 26 Nov 2022 • Junlin Hou, Jilan Xu, Fan Xiao, Rui-Wei Zhao, Yuejie Zhang, Haidong Zou, Lina Lu, Wenwen Xue, Rui Feng
However, automatic DR grading based on two-field fundus photography remains a challenging task due to the lack of publicly available datasets and effective fusion strategies.
no code implementations • 26 Nov 2022 • Junlin Hou, Jilan Xu, Nan Zhang, Yuejie Zhang, Xiaobo Zhang, Rui Feng
In our approach, we devise a novel infection-aware 3D Contrastive Mixup Classification network for severity grading.
no code implementations • 26 Nov 2022 • Junlin Hou, Jilan Xu, Nan Zhang, Yi Wang, Yuejie Zhang, Xiaobo Zhang, Rui Feng
This paper presents our solution for the 2nd COVID-19 Competition, occurring in the framework of the AIMIA Workshop at the European Conference on Computer Vision (ECCV 2022).
1 code implementation • 2 Oct 2022 • Junlin Hou, Fan Xiao, Jilan Xu, Yuejie Zhang, Haidong Zou, Rui Feng
In the image quality assessment task, we create an ensemble of InceptionV3, SE-ResNeXt, and Vision Transformer models.
no code implementations • 5 Jul 2022 • Junlin Hou, Jilan Xu, Rui Feng, Yuejie Zhang
This paper presents our solution for the 2nd COVID-19 Competition, occurring in the framework of the AIMIA Workshop in the European Conference on Computer Vision (ECCV 2022).
1 code implementation • CVPR 2022 • Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Rui-Wei Zhao, Tao Zhang, Xuequan Lu, Shang Gao
In this paper, we empirically prove that this problem is associated with the mixup of the activation values between less discriminative foreground regions and the background.
no code implementations • 2 Dec 2018 • Jiawei Zhang, Yuzhen Jin, Jilan Xu, Xiaowei Xu, Yanchun Zhang
The three multi-scale dense connections improve U-net performance by up to 1. 8% on test A and 3. 5% on test B in the MICCAI Gland dataset.