Search Results for author: Xinpeng Ding

Found 16 papers, 10 papers with code

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

1 code implementation2 Jan 2024 Xinpeng Ding, Jinahua Han, Hang Xu, Xiaodan Liang, Wei zhang, Xiaomeng Li

BEV-InMLLM integrates multi-view, spatial awareness, and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks.

Autonomous Driving

EtC: Temporal Boundary Expand then Clarify for Weakly Supervised Video Grounding with Multimodal Large Language Model

no code implementations5 Dec 2023 Guozhang Li, Xinpeng Ding, De Cheng, Jie Li, Nannan Wang, Xinbo Gao

To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones.

Boundary Detection Language Modelling +2

GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation

1 code implementation ICCV 2023 Jiewen Yang, Xinpeng Ding, Ziyang Zheng, Xiaowei Xu, Xiaomeng Li

This paper studies the unsupervised domain adaption (UDA) for echocardiogram video segmentation, where the goal is to generalize the model trained on the source domain to other unlabelled target domains.

Graph Matching Segmentation +3

GL-Fusion: Global-Local Fusion Network for Multi-view Echocardiogram Video Segmentation

1 code implementation20 Sep 2023 Ziyang Zheng, Jiewen Yang, Xinpeng Ding, Xiaowei Xu, Xiaomeng Li

Additionally, a Multi-view Local-based Fusion Module (MLFM) is designed to extract correlations of cardiac structures from different views.

Video Segmentation Video Semantic Segmentation

HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

no code implementations11 Sep 2023 Xinpeng Ding, Jianhua Han, Hang Xu, Wei zhang, Xiaomeng Li

For the first time, we leverage singular multimodal large language models (MLLMs) to consolidate multiple autonomous driving tasks from videos, i. e., the Risk Object Localization and Intention and Suggestion Prediction (ROLISP) task.

Autonomous Driving Object Localization

Uniformly Distributed Category Prototype-Guided Vision-Language Framework for Long-Tail Recognition

no code implementations24 Aug 2023 Siming Fu, Xiaoxuan He, Xinpeng Ding, Yuchen Cao, Hualiang Wang

Category prototype-guided mechanism for image-text matching makes the features of different classes converge to these distinct and uniformly distributed category prototypes, which maintain a uniform distribution in the feature space, and improve class boundaries.

Attribute Image-text matching +1

Boosting Weakly-Supervised Temporal Action Localization with Text Information

1 code implementation CVPR 2023 Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Xiaoyu Wang, Xinbo Gao

For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments.

Sentence Weakly-supervised Temporal Action Localization +1

Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint

1 code implementation25 Apr 2023 Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Jie Li, Xinbo Gao

The proposed Bi-SCC firstly adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video; Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions.

Weakly-supervised Temporal Action Localization Weakly Supervised Temporal Action Localization

Cyclical Self-Supervision for Semi-Supervised Ejection Fraction Prediction from Echocardiogram Videos

1 code implementation20 Oct 2022 Weihang Dai, Xiaomeng Li, Xinpeng Ding, Kwang-Ting Cheng

We also introduce teacher-student distillation to distill the information from LV segmentation masks into an end-to-end LVEF regression model that only requires video inputs.

LV Segmentation regression +2

Learning Shadow Correspondence for Video Shadow Detection

no code implementations30 Jul 2022 Xinpeng Ding, Jingweng Yang, Xiaowei Hu, Xiaomeng Li

We further design a new evaluation metric to evaluate the temporal stability of the video shadow detection results.

Shadow Detection

Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions

1 code implementation19 May 2022 Xinpeng Ding, Ziwei Liu, Xiaomeng Li

Our key insight is to distill knowledge from publicly available models trained on large generic datasets4 to facilitate the self-supervised learning of surgical videos.

Contrastive Learning Self-Supervised Learning +2

Less is More: Surgical Phase Recognition from Timestamp Supervision

1 code implementation16 Feb 2022 Xinpeng Ding, Xinjian Yan, Zixun Wang, Wei Zhao, Jian Zhuang, Xiaowei Xu, Xiaomeng Li

Our study uncovers unique insights of surgical phase recognition with timestamp supervisions: 1) timestamp annotation can reduce 74% annotation time compared with the full annotation, and surgeons tend to annotate those timestamps near the middle of phases; 2) extensive experiments demonstrate that our method can achieve competitive results compared with full supervision methods, while reducing manual annotation cost; 3) less is more in surgical phase recognition, i. e., less but discriminative pseudo labels outperform full but containing ambiguous frames; 4) the proposed UATD can be used as a plug and play method to clean ambiguous labels near boundaries between phases, and improve the performance of the current surgical phase recognition methods.

Surgical phase recognition

Support-Set Based Cross-Supervision for Video Grounding

no code implementations ICCV 2021 Xinpeng Ding, Nannan Wang, Shiwei Zhang, De Cheng, Xiaomeng Li, Ziyuan Huang, Mingqian Tang, Xinbo Gao

The contrastive objective aims to learn effective representations by contrastive learning, while the caption objective can train a powerful video encoder supervised by texts.

Contrastive Learning Video Grounding

Cannot find the paper you are looking for? You can Submit a new open access paper.