Search Results for author: Wentao Bao

Found 20 papers, 12 papers with code

Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction

1 code implementation10 Apr 2025 Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Xieyuanli Chen, Hesheng Wang

In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames.

Denoising Mamba +1

Window Token Concatenation for Efficient Visual Large Language Models

1 code implementation5 Apr 2025 YiFan Li, Wentao Bao, Botao Ye, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

To further enhance the performance on fine-grained visual understanding tasks, we introduce WiCo+, which decomposes the visual tokens in later layers of the LLM.

Token Reduction

Visual Large Language Models for Generalized and Specialized Applications

1 code implementation6 Jan 2025 YiFan Li, Zhixin Lai, Wentao Bao, Zhen Tan, Anh Dao, Kewei Sui, Jiayi Shen, Dong Liu, Huan Liu, Yu Kong

Visual-language models (VLM) have emerged as a powerful tool for learning a unified embedding space for vision and language.

Ethics

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection

1 code implementation17 Nov 2024 Wentao Bao, Kai Li, Yuxiao Chen, Deep Patel, Martin Renqiang Min, Yu Kong

Existing approaches focus on the closed-set setting where an action detector is trained and tested on videos from a fixed set of action categories.

Action Detection Open Vocabulary Action Detection

Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

no code implementations22 Sep 2024 Yuxiao Chen, Kai Li, Wentao Bao, Deep Patel, Yu Kong, Martin Renqiang Min, Dimitris N. Metaxas

Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos.

Contrastive Learning cross-modal alignment +4

MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos

no code implementations4 Sep 2024 Junyi Ma, Xieyuanli Chen, Wentao Bao, Jingyi Xu, Hesheng Wang

Understanding human intentions and actions through egocentric videos is important on the path to embodied artificial intelligence.

Denoising Mamba +2

Facial Affective Behavior Analysis with Instruction Tuning

1 code implementation7 Apr 2024 YiFan Li, Anh Dao, Wentao Bao, Zhen Tan, Tianlong Chen, Huan Liu, Yu Kong

Our initiative on the dataset and benchmarks reveal the nature and rationale of facial affective behaviors, i. e., fine-grained facial movement, interpretability, and reasoning.

Emotion Recognition Instruction Following

Latent Space Energy-based Model for Fine-grained Open Set Recognition

no code implementations19 Sep 2023 Wentao Bao, Qi Yu, Yu Kong

A recent trend in OSR shows the benefit of generative models to discriminative unknown detection.

Attribute Density Estimation +1

On Model Explanations with Transferable Neural Pathways

no code implementations18 Sep 2023 Xinmiao Lin, Wentao Bao, Qi Yu, Yu Kong

Neural pathways as model explanations consist of a sparse set of neurons that provide the same level of prediction performance as the whole model.

model

Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting

1 code implementation ICCV 2023 Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, Yu Kong

In this paper, we set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view.

3D Human Pose Tracking Trajectory Forecasting +1

Prompting Language-Informed Distribution for Compositional Zero-Shot Learning

1 code implementation23 May 2023 Wentao Bao, Lichang Chen, Heng Huang, Yu Kong

Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts, e. g., sliced tomatoes, where the model is learned only from the seen compositions, e. g., sliced potatoes and red tomatoes.

Compositional Zero-Shot Learning Informativeness +1

Towards Open Set Video Anomaly Detection

no code implementations23 Aug 2022 Yuansheng Zhu, Wentao Bao, Qi Yu

We develop a novel weakly supervised method for the OpenVAD problem by integrating evidential deep learning (EDL) and normalizing flows (NFs) into a multiple instance learning (MIL) framework.

Anomaly Detection Multiple Instance Learning +2

OpenTAL: Towards Open Set Temporal Action Localization

1 code implementation CVPR 2022 Wentao Bao, Qi Yu, Yu Kong

The OpenTAL is general to enable existing TAL models for open set scenarios, and experimental results on THUMOS14 and ActivityNet1. 3 benchmarks show the effectiveness of our method.

Action Classification Classification +2

Gradient Frequency Modulation for Visually Explaining Video Understanding Models

no code implementations1 Nov 2021 Xinmiao Lin, Wentao Bao, Matthew Wright, Yu Kong

In many applications, it is essential to understand why a machine learning model makes the decisions it does, but this is inhibited by the black-box nature of state-of-the-art neural networks.

Action Recognition Temporal Action Localization +1

Evidential Deep Learning for Open Set Action Recognition

2 code implementations ICCV 2021 Wentao Bao, Qi Yu, Yu Kong

Different from image data, video actions are more challenging to be recognized in an open-set setting due to the uncertain temporal dynamics and static bias of human actions.

Deep Learning Open Set Action Recognition +2

DRIVE: Deep Reinforced Accident Anticipation with Visual Explanation

1 code implementation ICCV 2021 Wentao Bao, Qi Yu, Yu Kong

Traffic accident anticipation aims to accurately and promptly predict the occurrence of a future accident from dashcam videos, which is vital for a safety-guaranteed self-driving system.

Accident Anticipation Decision Making

Group Activity Prediction with Sequential Relational Anticipation Model

1 code implementation ECCV 2020 Junwen Chen, Wentao Bao, Yu Kong

Our model explicitly anticipates both activity features and positions by two graph auto-encoders, aiming to learn a discriminative group representation for group activity prediction.

Activity Prediction model +1

Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning

2 code implementations1 Aug 2020 Wentao Bao, Qi Yu, Yu Kong

The derived uncertainty-based ranking loss is found to significantly boost model performance by improving the quality of relational features.

Accident Anticipation Activity Prediction +4

Object-Aware Centroid Voting for Monocular 3D Object Detection

no code implementations20 Jul 2020 Wentao Bao, Qi Yu, Yu Kong

Monocular 3D object detection aims to detect objects in a 3D physical world from a single camera.

Depth Estimation Monocular 3D Object Detection +3

Cannot find the paper you are looking for? You can Submit a new open access paper.