Search Results for author: Jilan Xu

Found 28 papers, 14 papers with code

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

1 code implementation2 Mar 2025 Baoqi Pei, Yifei HUANG, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, LiMin Wang

However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects.

Large Language Model Multi-Instance Retrieval +4

Text-promptable Propagation for Referring Medical Image Sequence Segmentation

no code implementations16 Feb 2025 Runtian Yuan, Jilan Xu, Mohan Chen, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao

We develop a strong baseline model, Text-Promptable Propagation (TPP), designed to exploit the intrinsic relationships among sequential images and their associated textual descriptions.

Interactive Segmentation Segmentation

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

no code implementations16 Dec 2024 Guo Chen, Yicheng Liu, Yifei HUANG, Yuping He, Baoqi Pei, Jilan Xu, Yali Wang, Tong Lu, LiMin Wang

However, because of the inherent limitation of MCQ-based evaluation and the increasing reasoning ability of MLLMs, models can give the current answer purely by combining short video understanding with elimination, without genuinely understanding the video content.

Hallucination Multiple-choice +2

EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation

1 code implementation26 Jun 2024 Baoqi Pei, Guo Chen, Jilan Xu, Yuping He, Yicheng Liu, Kanghua Pan, Yifei HUANG, Yali Wang, Tong Lu, LiMin Wang, Yu Qiao

In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge.

 Ranked #1 on Long Term Action Anticipation on Ego4D (using extra training data)

Action Anticipation Action Recognition +6

Concept-Attention Whitening for Interpretable Skin Lesion Diagnosis

no code implementations9 Apr 2024 Junlin Hou, Jilan Xu, Hao Chen

In the former branch, we train a convolutional neural network (CNN) with an inserted CAW layer to perform skin lesion diagnosis.

Concept Alignment Diagnostic +2

QMix: Quality-aware Learning with Mixed Noise for Robust Retinal Disease Diagnosis

no code implementations8 Apr 2024 Junlin Hou, Jilan Xu, Rui Feng, Hao Chen

Previous noise learning methods mainly considered noise arising from images being mislabeled, i. e. label noise, assuming that all mislabeled images are of high image quality.

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

1 code implementation CVPR 2024 Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao

Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.

 Ranked #1 on Action Anticipation on EgoExoLearn (using extra training data)

Action Anticipation Action Quality Assessment +2

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

2 code implementations22 Mar 2024 Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue.

Action Classification Action Recognition +13

Domain Adaptation Using Pseudo Labels for COVID-19 Detection

no code implementations18 Mar 2024 Runtian Yuan, Qingqiu Li, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen

In response to the need for rapid and accurate COVID-19 diagnosis during the global pandemic, we present a two-stage framework that leverages pseudo labels for domain adaptation to enhance the detection of COVID-19 from CT scans.

COVID-19 Diagnosis Diagnostic +2

Advancing COVID-19 Detection in 3D CT Scans

no code implementations18 Mar 2024 Qingqiu Li, Runtian Yuan, Junlin Hou, Jilan Xu, Yuejie Zhang, Rui Feng, Hao Chen

To make a more accurate diagnosis of COVID-19, we propose a straightforward yet effective model.

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

1 code implementation14 Mar 2024 Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

Mamba Moment Retrieval +2

Anatomical Structure-Guided Medical Vision-Language Pre-training

no code implementations14 Mar 2024 Qingqiu Li, Xiaohan Yan, Jilan Xu, Runtian Yuan, Yuejie Zhang, Rui Feng, Quanli Shen, Xiaobo Zhang, Shujun Wang

For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs.

Contrastive Learning Decoder +3

Retrieval-Augmented Egocentric Video Captioning

no code implementations CVPR 2024 Jilan Xu, Yifei HUANG, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie

In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos.

Representation Learning Retrieval +1

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

3 code implementations CVPR 2024 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Diagnostic Fairness +11

Enhanced Knowledge Injection for Radiology Report Generation

no code implementations1 Nov 2023 Qingqiu Li, Jilan Xu, Runtian Yuan, Mohan Chen, Yuejie Zhang, Rui Feng, Xiaobo Zhang, Shang Gao

Automatic generation of radiology reports holds crucial clinical value, as it can alleviate substantial workload on radiologists and remind less experienced ones of potential anomalies.

Image Captioning Retrieval

VideoLLM: Modeling Video Sequence with Large Language Models

1 code implementation22 May 2023 Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang

Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.

Decoder Video Understanding

Mask Hierarchical Features For Self-Supervised Learning

no code implementations1 Apr 2023 Fenggang Liu, Yangguang Li, Feng Liang, Jilan Xu, Bin Huang, Jing Shao

We mask part of patches in the representation space and then utilize sparse visible patches to reconstruct high semantic image representation.

object-detection Object Detection +1

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

1 code implementation CVPR 2023 Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie

The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities.

Open Vocabulary Semantic Segmentation Open-Vocabulary Semantic Segmentation +1

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

2 code implementations6 Dec 2022 Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

 Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)

Action Classification Contrastive Learning +8

Cross-Field Transformer for Diabetic Retinopathy Grading on Two-field Fundus Images

1 code implementation26 Nov 2022 Junlin Hou, Jilan Xu, Fan Xiao, Rui-Wei Zhao, Yuejie Zhang, Haidong Zou, Lina Lu, Wenwen Xue, Rui Feng

However, automatic DR grading based on two-field fundus photography remains a challenging task due to the lack of publicly available datasets and effective fusion strategies.

Diabetic Retinopathy Grading Position

CMC v2: Towards More Accurate COVID-19 Detection with Discriminative Video Priors

no code implementations26 Nov 2022 Junlin Hou, Jilan Xu, Nan Zhang, Yi Wang, Yuejie Zhang, Xiaobo Zhang, Rui Feng

This paper presents our solution for the 2nd COVID-19 Competition, occurring in the framework of the AIMIA Workshop at the European Conference on Computer Vision (ECCV 2022).

COVID-19 Diagnosis Representation Learning

FDVTS's Solution for 2nd COV19D Competition on COVID-19 Detection and Severity Analysis

no code implementations5 Jul 2022 Junlin Hou, Jilan Xu, Rui Feng, Yuejie Zhang

This paper presents our solution for the 2nd COVID-19 Competition, occurring in the framework of the AIMIA Workshop in the European Conference on Computer Vision (ECCV 2022).

Classification COVID-19 Diagnosis +1

CREAM: Weakly Supervised Object Localization via Class RE-Activation Mapping

1 code implementation CVPR 2022 Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Rui-Wei Zhao, Tao Zhang, Xuequan Lu, Shang Gao

In this paper, we empirically prove that this problem is associated with the mixup of the activation values between less discriminative foreground regions and the background.

Clustering Object +2

MDU-Net: Multi-scale Densely Connected U-Net for biomedical image segmentation

no code implementations2 Dec 2018 Jiawei Zhang, Yuzhen Jin, Jilan Xu, Xiaowei Xu, Yanchun Zhang

The three multi-scale dense connections improve U-net performance by up to 1. 8% on test A and 3. 5% on test B in the MICCAI Gland dataset.

Decoder Image Segmentation +3

Cannot find the paper you are looking for? You can Submit a new open access paper.