Search Results for author: Zhaoxin Fan

Found 55 papers, 17 papers with code

SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting

no code implementations17 Jun 2025 Ziqiao Peng, Wentao Hu, Junyuan Ma, Xiangyu Zhu, Xiaomei Zhang, Hao Zhao, Hui Tian, Jun He, Hongyan Liu, Zhaoxin Fan

A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses.

HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment

no code implementations26 May 2025 Ming Meng, Qi Dong, Jiajie Li, Zhe Zhu, Xingyu Wang, Zhaoxin Fan, Wei Zhao, Wenjun Wu

Virtual try-on technology has become increasingly important in the fashion and retail industries, enabling the generation of high-fidelity garment images that adapt seamlessly to target human models.

Virtual Try-on

DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations

no code implementations CVPR 2025 Ziqiao Peng, Yanbo Fan, HaoYu Wu, Xuan Wang, Hongyan Liu, Jun He, Zhaoxin Fan

To address this issue, we propose a new task -- multi-round dual-speaker interaction for 3D talking head generation -- which requires models to handle and generate both speaking and listening behaviors in continuous conversation.

Talking Head Generation

BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models

no code implementations22 May 2025 Xiaobei Yan, Yiming Li, Zhaoxin Fan, Han Qiu, Tianwei Zhang

Large language models (LLMs) have shown impressive capabilities across a wide range of applications, but their ever-increasing size and resource demands make them vulnerable to inference cost attacks, where attackers induce victim LLMs to generate the longest possible output content.

AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

no code implementations21 May 2025 Tianbao Zhang, Jian Zhao, Yuer Li, Zheng Zhu, Ping Hu, Zhaoxin Fan, Wenjun Wu, Xuelong Li

Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication.

TinyAlign: Boosting Lightweight Vision-Language Models by Mitigating Modal Alignment Bottlenecks

no code implementations19 May 2025 Yuanze Hu, Zhaoxin Fan, Xinyu Wang, Gen Li, Ye Qiu, Zhichao Yang, Wenjun Wu, Kejian Wu, Yifan Sun, Xiaotie Deng, Jin Dong

Our work thus offers a practical pathway for developing more capable lightweight VLMs while introducing a fresh theoretical lens to better understand and address alignment bottlenecks in constrained multimodal systems.

Language Modeling Language Modelling +1

Black-box Adversaries from Latent Space: Unnoticeable Attacks on Human Pose and Shape Estimation

no code implementations17 May 2025 Zhiying Li, GuangGang Geng, Yeying Jin, Zhizhi Guo, Bruce Gu, Jidong Huo, Zhaoxin Fan, Wenjun Wu

These findings underscore the urgent need to address and mitigate security risks associated with digital human generation systems.

Pose Estimation

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

no code implementations22 Apr 2025 Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, Liang Lin, Zhihao Xu, Haolang Lu, Xinye Cao, Xinyun Zhou, Weifei Jin, Fanci Meng, Shicheng Xu, Junyuan Mao, Yu Wang, Hao Wu, Minghe Wang, Fan Zhang, Junfeng Fang, Wenjie Qu, Yue Liu, Chengwei Liu, Yifan Zhang, Qiankun Li, Chongye Guo, Yalan Qin, Zhaoxin Fan, Kai Wang, Yi Ding, Donghai Hong, Jiaming Ji, Yingxin Lai, Zitong Yu, Xinfeng Li, Yifan Jiang, Yanhui Li, Xinyu Deng, Junlin Wu, Dongxia Wang, Yihao Huang, Yufei Guo, Jen-tse Huang, Qiufeng Wang, Xiaolong Jin, Wenxuan Wang, Dongrui Liu, Yanwei Yue, Wenke Huang, Guancheng Wan, Heng Chang, Tianlin Li, Yi Yu, Chenghao Li, Jiawei Li, Lei Bai, Jie Zhang, Qing Guo, Jingyi Wang, Tianlong Chen, Joey Tianyi Zhou, Xiaojun Jia, Weisong Sun, Cong Wu, Jing Chen, Xuming Hu, Yiming Li, Xiao Wang, Ningyu Zhang, Luu Anh Tuan, Guowen Xu, Jiaheng Zhang, Tianwei Zhang, Xingjun Ma, Jindong Gu, Liang Pang, Xiang Wang, Bo An, Jun Sun, Mohit Bansal, Shirui Pan, Lingjuan Lyu, Yuval Elovici, Bhavya Kailkhura, Yaodong Yang, Hongwei Li, Wenyuan Xu, Yizhou Sun, Wei Wang, Qing Li, Ke Tang, Yu-Gang Jiang, Felix Juefei-Xu, Hui Xiong, XiaoFeng Wang, DaCheng Tao, Philip S. Yu, Qingsong Wen, Yang Liu

Currently, existing surveys on LLM safety primarily focus on specific stages of the LLM lifecycle, e. g., deployment phase or fine-tuning phase, lacking a comprehensive understanding of the entire "lifechain" of LLMs.

Model Editing

Unicorn: Text-Only Data Synthesis for Vision Language Model Training

1 code implementation28 Mar 2025 Xiaomin Yu, Pengxiang Ding, Wenjie Zhang, Siteng Huang, Songyang Gao, Chengwei Qin, Kejian Wu, Zhaoxin Fan, Ziyue Qiao, Donglin Wang

By eliminating the dependency on real images while maintaining data quality and diversity, our framework offers a cost-effective and scalable solution for VLMs training.

Language Modeling Language Modelling

STAMICS: Splat, Track And Map with Integrated Consistency and Semantics for Dense RGB-D SLAM

no code implementations27 Mar 2025 Yongxu Wang, Xu Cao, Weiyun Yi, Zhaoxin Fan

Simultaneous Localization and Mapping (SLAM) is a critical task in robotics, enabling systems to autonomously navigate and understand complex environments.

Camera Pose Estimation Navigate +2

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

1 code implementation12 Mar 2025 Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline.

Chunking Computational Efficiency +3

ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

no code implementations9 Mar 2025 Xukun Zhou, Fengxin Li, Ming Chen, Yan Zhou, Pengfei Wan, Di Zhang, Hongyan Liu, Jun He, Zhaoxin Fan

Audio-driven human gesture synthesis is a crucial task with broad applications in virtual avatars, human-computer interaction, and creative content generation.

Diversity Retrieval

DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue

no code implementations19 Feb 2025 Feiyuan Zhang, Dezhi Zhu, James Ming, Yilun Jin, Di Chai, Liu Yang, Han Tian, Zhaoxin Fan, Kai Chen

Retrieval-Augmented Generation (RAG) systems have shown substantial benefits in applications such as question answering and multi-turn dialogue \citep{lewis2020retrieval}.

Question Answering RAG +2

VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

1 code implementation15 Feb 2025 Ming Meng, Ke Mu, Yonggui Zhu, Zhe Zhu, Haoyu Sun, Heyang Yan, Zhaoxin Fan

Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation.

3D Human Pose Estimation Diversity +1

TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding

1 code implementation26 Jan 2025 Xingjian Zhang, Xi Weng, Yihao Yue, Zhaoxin Fan, Wenjun Wu, Lei Huang

We present the TinyLLaVA-Video, a video understanding model with parameters not exceeding 4B that processes video sequences in a simple manner, without the need for complex architectures, supporting both fps sampling and uniform frame sampling.

Video Understanding

JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems

no code implementations CVPR 2025 Yifan Wang, Jian Zhao, Zhaoxin Fan, Xin Zhang, Xuecheng Wu, Yudian Zhang, Lei Jin, Xinyue Li, Gang Wang, Mengxi Jia, Ping Hu, Zheng Zhu, Xuelong Li

To benchmark this task, we introduce the TDUAV dataset, the largest dataset for joint UAV tracking and intent understanding, featuring 1, 328 challenging video sequences, over 163K annotated thermal frames, and 3K VQA pairs.

Question Answering Visual Question Answering

EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

1 code implementation29 Dec 2024 Daiheng Gao, Shilin Lu, Shaw Walters, Wenbo Zhou, Jiaming Chu, Jie Zhang, Bang Zhang, Mengxi Jia, Jian Zhao, Zhaoxin Fan, Weiming Zhang

Removing unwanted concepts from large-scale text-to-image (T2I) diffusion models while maintaining their overall generative quality remains an open challenge.

Contrastive Learning

MambaVO: Deep Visual Odometry Based on Sequential Matching Refinement and Training Smoothing

no code implementations CVPR 2025 Shuo Wang, Wanting Li, Yongcai Wang, Zhaoxin Fan, Zhe Huang, Xudong Cai, Jian Zhao, Deying Li

To address this challenge, this paper proposes MambaVO, which conducts robust initialization, Mamba-based sequential matching refinement, and smoothed training to enhance the matching quality and improve the pose estimation in deep visual odometry.

Mamba Pose Estimation +1

Dust to Tower: Coarse-to-Fine Photo-Realistic Scene Reconstruction from Sparse Uncalibrated Images

no code implementations27 Dec 2024 Xudong Cai, Yongcai Wang, Zhaoxin Fan, Deng Haoran, Shuo Wang, Wanting Li, Deying Li, Lun Luo, Minhang Wang, Jintao Xu

To refine the 3D model at novel viewpoints, we propose a Confidence Aware Depth Alignment (CADA) module to refine the coarse depth maps by aligning their confident parts with estimated depths by a Mono-depth model.

3DGS Novel View Synthesis +1

Score and Distribution Matching Policy: Advanced Accelerated Visuomotor Policies via Matched Distillation

no code implementations12 Dec 2024 Bofang Jia, Pengxiang Ding, Can Cui, Mingyang Sun, Pengfang Qian, Siteng Huang, Zhaoxin Fan, Donglin Wang

Visual-motor policy learning has advanced with architectures like diffusion-based policies, known for modeling complex robotic trajectories.

Moderating the Generalization of Score-based Generative Model

no code implementations10 Dec 2024 Wan Jiang, He Wang, Xin Zhang, Dan Guo, Zhaoxin Fan, Yunfeng Diao, Richang Hong

To fill this gap, we first examine the current 'gold standard' in Machine Unlearning (MU), i. e., re-training the model after removing the undesirable training data, and find it does not work in SGMs.

Image Inpainting Machine Unlearning +1

CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction

no code implementations9 Dec 2024 Zhefei Gong, Pengxiang Ding, Shangke Lyu, Siteng Huang, Mingyang Sun, Wei Zhao, Zhaoxin Fan, Donglin Wang

In this paper, we introduce Coarse-to-Fine AutoRegressive Policy (CARP), a novel paradigm for visuomotor policy learning that redefines the autoregressive action generation process as a coarse-to-fine, next-scale approach.

Action Generation Denoising

VGG-Tex: A Vivid Geometry-Guided Facial Texture Estimation Model for High Fidelity Monocular 3D Face Reconstruction

no code implementations15 Sep 2024 HaoYu Wu, Ziqiao Peng, Xukun Zhou, Yunfei Cheng, Jun He, Hongyan Liu, Zhaoxin Fan

Specifically, VGG-Tex includes a Facial Attributes Encoding Module, a Geometry-Guided Texture Generator, and a Visibility-Enhanced Texture Completion Module.

3D Face Reconstruction

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face Animation

no code implementations21 Aug 2024 Yihong Lin, Liang Peng, Zhaoxin Fan, Xianjia Wu, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei

EmoFace employs a novel Mesh Attention mechanism to analyse and fuse the emotion features and content features.

3D Face Animation

GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

no code implementations3 Aug 2024 Yihong Lin, Zhaoxin Fan, Xianjia Wu, Lingyu Xiong, Liang Peng, Xiandong Li, Wenxiong Kang, Songju Lei, Huang Xu

Speech-driven talking head generation is a critical yet challenging task with applications in augmented reality and virtual human modeling.

Diversity Talking Head Generation

MLPHand: Real Time Multi-View 3D Hand Mesh Reconstruction via MLP Modeling

no code implementations23 Jun 2024 Jian Yang, Jiakun Li, Guoming Li, Zhen Shen, Huai-Yu Wu, Zhaoxin Fan, Heng Huang

Multi-view hand mesh reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge.

A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing

no code implementations15 Jun 2024 Ming Meng, Yufei Zhao, Bo Zhang, Yonggui Zhu, Weimin Shi, Maxwell Wen, Zhaoxin Fan

Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production.

Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

1 code implementation5 Apr 2024 JunHao Chen, Xiang Li, Xiaojun Ye, Chao Li, Zhaoxin Fan, Hao Zhao

Recently, this success has been extended to 3D AIGC, with state-of-the-art methods generating textured 3D models from single images or text.

3D Generation Image to 3D +1

Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail

1 code implementation18 Mar 2024 Mingjin Chen, JunHao Chen, Xiaojun Ye, Huan-ang Gao, Xiaoxue Chen, Zhaoxin Fan, Hao Zhao

In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image.

Lifelike 3D Human Generation

AS-FIBA: Adaptive Selective Frequency-Injection for Backdoor Attack on Deep Face Restoration

no code implementations11 Mar 2024 Zhenbo Song, Wenhao Gao, Kaihao Zhang, Wenhan Luo, Zhaoxin Fan, Jianfeng Lu

Extensive experiments demonstrate the efficacy of the degradation objective on state-of-the-art face restoration models.

Backdoor Attack

Adversarial Purification and Fine-tuning for Robust UDC Image Restoration

no code implementations21 Feb 2024 Zhenbo Song, Zhenyuan Zhang, Kaihao Zhang, Zhaoxin Fan, Jianfeng Lu

This study delves into the enhancement of Under-Display Camera (UDC) image restoration models, focusing on their robustness against adversarial attacks.

Adversarial Purification Image Restoration

BeatDance: A Beat-Based Model-Agnostic Contrastive Learning Framework for Music-Dance Retrieval

no code implementations16 Oct 2023 Kaixing Yang, Xukun Zhou, Xulong Tang, Ran Diao, Hongyan Liu, Jun He, Zhaoxin Fan

Dance and music are closely related forms of expression, with mutual retrieval between dance videos and music being a fundamental task in various fields like education, art, and sports.

Contrastive Learning Retrieval

Multi-dimensional Fusion and Consistency for Semi-supervised Medical Image Segmentation

no code implementations12 Sep 2023 Yixing Lu, Zhaoxin Fan, Min Xu

In this paper, we introduce a novel semi-supervised learning framework tailored for medical image segmentation.

Image Segmentation Semantic Segmentation +1

D-IF: Uncertainty-aware Human Digitization via Implicit Distribution Field

1 code implementation ICCV 2023 Xueting Yang, Yihao Luo, Yuliang Xiu, Wei Wang, Hao Xu, Zhaoxin Fan

In this paper, we propose replacing the implicit value with an adaptive uncertainty distribution, to differentiate between points based on their distance to the surface.

Benchmarking Ultra-High-Definition Image Reflection Removal

1 code implementation1 Aug 2023 Zhenyuan Zhang, Zhenbo Song, Kaihao Zhang, Zhaoxin Fan, Jianfeng Lu

To the best of our knowledge, these two datasets are the first largest-scale UHD datasets for SIRR.

Benchmarking Image Restoration +1

SelfTalk: A Self-Supervised Commutative Training Diagram to Comprehend 3D Talking Faces

1 code implementation19 Jun 2023 Ziqiao Peng, Yihao Luo, Yue Shi, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, Zhaoxin Fan

To enhance the visual accuracy of generated lip movement while reducing the dependence on labeled data, we propose a novel framework SelfTalk, by involving self-supervision in a cross-modals network system to learn 3D talking faces.

3D Face Animation Lip Reading

EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation

2 code implementations ICCV 2023 Ziqiao Peng, HaoYu Wu, Zhenbo Song, Hao Xu, Xiangyu Zhu, Jun He, Hongyan Liu, Zhaoxin Fan

Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels.

3D Face Animation Decoder +1

SHLE: Devices Tracking and Depth Filtering for Stereo-based Height Limit Estimation

1 code implementation22 Dec 2022 Zhaoxin Fan, Kaixing Yang, Min Zhang, Zhenbo Song, Hongyan Liu, Jun He

In stage 1, a novel devices detection and tracking scheme is introduced, which accurately locate the height limit devices in the left or right image.

FuRPE: Learning Full-body Reconstruction from Part Experts

1 code implementation30 Nov 2022 Zhaoxin Fan, Yuqing Pan, Hao Xu, Zhenbo Song, Zhicheng Wang, Kejian Wu, Hongyan Liu, Jun He

These novel elements of FuRPE not only serve to further refine the model but also to reduce potential biases that may arise from inaccuracies in pseudo labels, thereby optimizing the network's training process and enhancing the robustness of the model.

GIDP: Learning a Good Initialization and Inducing Descriptor Post-enhancing for Large-scale Place Recognition

no code implementations23 Sep 2022 Zhaoxin Fan, Zhenbo Song, Hongyan Liu, Jun He

Large-scale place recognition is a fundamental but challenging task, which plays an increasingly important role in autonomous driving and robotics.

Autonomous Driving Reranking

Human Pose Driven Object Effects Recommendation

no code implementations17 Sep 2022 Zhaoxin Fan, Fengxin Li, Hongyan Liu, Jun He, Xiaoyong Du

In this paper, we research the new topic of object effects recommendation in micro-video platforms, which is a challenging but important task for many practical applications such as advertisement insertion.

Object

MonoSIM: Simulating Learning Behaviors of Heterogeneous Point Cloud Object Detectors for Monocular 3D Object Detection

1 code implementation19 Aug 2022 Han Sun, Zhaoxin Fan, Zhenbo Song, Zhicheng Wang, Kejian Wu, Jianfeng Lu

The insight behind introducing MonoSIM is that we propose to simulate the feature learning behaviors of a point cloud based detector for monocular detector during the training period.

Autonomous Driving Depth Estimation +4

Reconstruction-Aware Prior Distillation for Semi-supervised Point Cloud Completion

no code implementations20 Apr 2022 Zhaoxin Fan, Yulin He, Zhicheng Wang, Kejian Wu, Hongyan Liu, Jun He

Real-world sensors often produce incomplete, irregular, and noisy point clouds, making point cloud completion increasingly important.

Point Cloud Completion

Object Level Depth Reconstruction for Category Level 6D Object Pose Estimation From Monocular RGB Image

no code implementations4 Apr 2022 Zhaoxin Fan, Zhenbo Song, Jian Xu, Zhicheng Wang, Kejian Wu, Hongyan Liu, Jun He

Recently, RGBD-based category-level 6D object pose estimation has achieved promising improvement in performance, however, the requirement of depth information prohibits broader applications.

6D Pose Estimation using RGB Object

RPR-Net: A Point Cloud-based Rotation-aware Large Scale Place Recognition Network

no code implementations29 Aug 2021 Zhaoxin Fan, Zhenbo Song, Wenping Zhang, Hongyan Liu, Jun He, Xiaoyong Du

Third, we apply these kernels to previous point cloud features to generate new features, which is the well-known SO(3) mapping process.

Autonomous Driving Point Cloud Retrieval +2

Deep Learning on Monocular Object Pose Detection and Tracking: A Comprehensive Overview

no code implementations29 May 2021 Zhaoxin Fan, Yazhi Zhu, Yulin He, Qi Sun, Hongyan Liu, Jun He

Therefore, this study presents a comprehensive review of recent progress in object pose detection and tracking that belongs to the deep learning technical route.

Autonomous Driving Deep Learning +2

Cannot find the paper you are looking for? You can Submit a new open access paper.