Search Results for author: Li Yuan

Found 117 papers, 70 papers with code

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

1 code implementation3 Apr 2025 Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan

The recent breakthroughs in OpenAI's GPT4o model have demonstrated surprisingly good capabilities in image generation and editing, resulting in significant excitement in the community.

Image Generation World Knowledge

NeuralGS: Bridging Neural Fields and 3D Gaussian Splatting for Compact 3D Representations

no code implementations29 Mar 2025 Zhenyu Tang, Chaoran Feng, Xinhua Cheng, Wangbo Yu, Junwu Zhang, YuAn Liu, Xiaoxiao Long, Wenping Wang, Li Yuan

The compression performance of our method on original 3DGS is comparable to the dedicated Scaffold-GS-based compression methods, which demonstrate the huge potential of directly compressing original 3DGS with neural fields.

3DGS NeRF +1

LangBridge: Interpreting Image as a Combination of Language Embeddings

no code implementations25 Mar 2025 Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, Jifeng Dai, Yu Cheng

Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings.

cross-modal alignment

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

no code implementations19 Mar 2025 Qihui Zhang, Munan Ning, Zheyuan Liu, Yanbo Wang, Jiayi Ye, Yue Huang, Shuo Yang, Xiao Chen, Yibing Song, Li Yuan

Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models.

Language Modeling Language Modelling +4

SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video

1 code implementation12 Mar 2025 ChengShu Zhao, Yunyang Ge, Xinhua Cheng, Bin Zhu, Yatian Pang, Bin Lin, Fan Yang, Feng Gao, Li Yuan

Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years.

Video Inpainting

UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion

no code implementations9 Mar 2025 Gongbo Zhang, Yanting Li, Renqian Luo, Pipi Hu, Zeru Zhao, Lingbo Li, Guoqing Liu, Zun Wang, Ran Bi, Kaiyuan Gao, Liya Guo, Yu Xie, Chang Liu, Jia Zhang, Tian Xie, Robert Pinsler, Claudio Zeni, Ziheng Lu, Yingce Xia, Marwin Segler, Maik Riechert, Li Yuan, Lei Chen, Haiguang Liu, Tao Qin

We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation.

Text Generation

Magic 1-For-1: Generating One Minute Video Clips within One Minute

1 code implementation11 Feb 2025 Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou

The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation.

Image to Video Generation Text-to-Image Generation +1

GPT as a Monte Carlo Language Tree: A Probabilistic Perspective

no code implementations13 Jan 2025 Kun-Peng Ning, Jia-Yu Yao, Yu-Yang Liu, Mu-Nan Ning, Li Yuan

In this paper, we propose a novel perspective that any language dataset can be represented by a Monte Carlo Language Tree (abbreviated as ``Data-Tree''), where each node denotes a token, each edge denotes a token transition probability, and each sequence has a unique path.

Hallucination

AE-NeRF: Augmenting Event-Based Neural Radiance Fields for Non-ideal Conditions and Larger Scene

no code implementations6 Jan 2025 Chaoran Feng, Wangbo Yu, Xinhua Cheng, Zhenyu Tang, Junwu Zhang, Li Yuan, Yonghong Tian

Compared to frame-based methods, computational neuromorphic imaging using event cameras offers significant advantages, such as minimal motion blur, enhanced temporal resolution, and high dynamic range.

3D Reconstruction NeRF

Hierarchical Banzhaf Interaction for General Video-Language Representation Learning

1 code implementation30 Dec 2024 Peng Jin, Hao Li, Li Yuan, Shuicheng Yan, Jie Chen

As an important subfield, video-language representation learning focuses on learning representations using global semantic interactions between pre-defined video-text pairs.

Contrastive Learning Question Answering +4

RoomPainter: View-Integrated Diffusion for Consistent Indoor Scene Texturing

no code implementations21 Dec 2024 Zhipeng Huang, Wangbo Yu, Xinhua Cheng, ChengShu Zhao, Yunyang Ge, Mingyi Guo, Li Yuan, Yonghong Tian

The core of RoomPainter features a zero-shot technique that effectively adapts a 2D diffusion model for 3D-consistent texture synthesis, along with a two-stage generation strategy that ensures both global and local consistency.

Texture Synthesis

Next Patch Prediction for Autoregressive Visual Generation

1 code implementation19 Dec 2024 Yatian Pang, Peng Jin, Shuo Yang, Bin Lin, Bin Zhu, Zhenyu Tang, Liuhan Chen, Francis E. H. Tay, Ser-Nam Lim, Harry Yang, Li Yuan

Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks.

Image Generation Prediction

Open-Sora Plan: Open-Source Large Video Generation Model

5 code implementations28 Nov 2024 Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, Tanghui Jia, Junwu Zhang, Zhenyu Tang, Yatian Pang, Bin She, Cen Yan, Zhiheng Hu, Xiaoyi Dong, Lin Chen, Zhang Pan, Xing Zhou, Shaoling Dong, Yonghong Tian, Li Yuan

We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs.

Video Generation

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

1 code implementation26 Nov 2024 Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan

We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model.

Image to Video Generation Text-to-Video Generation

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

2 code implementations26 Nov 2024 Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, Li Yuan

However, as the resolution and duration of generated videos increase, the encoding cost of Video VAEs becomes a limiting bottleneck in training LVDMs.

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

1 code implementation23 Nov 2024 Zhiyuan Yan, Jiangming Wang, Peng Jin, Ke-Yue Zhang, Chengchun Liu, Shen Chen, Taiping Yao, Shouhong Ding, Baoyuan Wu, Li Yuan

AI-generated images (AIGIs), such as natural or face images, have become increasingly realistic and indistinguishable, making their detection a critical and pressing challenge.

Face Swapping Synthetic Image Detection

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

2 code implementations15 Nov 2024 Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, Li Yuan

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1.

Logical Reasoning Multimodal Reasoning +2

Sparse Orthogonal Parameters Tuning for Continual Learning

no code implementations5 Nov 2024 Kun-Peng Ning, Hai-Jian Ke, Yu-Yang Liu, Jia-Yu Yao, Yong-Hong Tian, Li Yuan

We hypothesize that the effectiveness of SoTU lies in the transformation of knowledge learned from multiple domains into the fusion of orthogonal delta parameters.

Continual Learning

Efficiently Training Time-to-First-Spike Spiking Neural Networks from Scratch

no code implementations31 Oct 2024 Kaiwei Che, Wei Fang, Zhengyu Ma, Yifan Huang, Peng Xue, Li Yuan, Timothée Masquelier, Yonghong Tian

To address these challenges, we propose a training framework incorporating parameter initialization, training normalization, temporal output decoding, and pooling layer re-evaluation.

Spatial-Temporal Search for Spiking Neural Networks

no code implementations24 Oct 2024 Kaiwei Che, Zhaokun Zhou, Li Yuan, JianGuo Zhang, Yonghong Tian, Luziwei Leng

Drawing inspiration from the heterogeneity of biological neural networks, we propose a differentiable approach to optimize SNN on both spatial and temporal dimensions.

Image Classification Neural Architecture Search

Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model

no code implementations18 Oct 2024 Li Yuan, Yi Cai, Junsheng Huang

This method can effectively address the problem of insufficient information in the few-shot setting by guiding a large language model to generate supplementary background knowledge.

Language Modeling Language Modelling +5

MoH: Multi-Head Attention as Mixture-of-Head Attention

3 code implementations15 Oct 2024 Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan

We show that multi-head attention can be expressed in the summation form.

Mixture-of-Experts

Is Parameter Collision Hindering Continual Learning in LLMs?

1 code implementation14 Oct 2024 Shuo Yang, Kun-Peng Ning, Yu-Yang Liu, Jia-Yu Yao, Yong-Hong Tian, Yi-Bing Song, Li Yuan

Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially, making continual learning (CL) essential for their dynamic deployment.

Continual Learning

MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

3 code implementations9 Oct 2024 Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan

In this work, we aim to simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts (MoE) methods.

Mixture-of-Experts

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

1 code implementation3 Sep 2024 Wangbo Yu, Jinbo Xing, Li Yuan, WenBo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, Yonghong Tian

Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control.

3D Generation 3D Reconstruction +3

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

1 code implementation2 Sep 2024 Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan

With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are.

Video Generation Video Reconstruction

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

no code implementations30 Aug 2024 Zhiyuan Yan, Yandan Zhao, Shen Chen, Mingyi Guo, Xinghe Fu, Taiping Yao, Shouhong Ding, Li Yuan

To reproduce FFD, we then propose a novel Video-level Blending data (VB), where VB is implemented by blending the original image and its warped version frame-by-frame, serving as a hard negative sample to mine more general artifacts.

Face Swapping Image Forgery Detection

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

no code implementations28 Jul 2024 Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Wangbo Yu, Chaoran Feng, Yatian Pang, Bin Lin, Li Yuan

Recent 3D large reconstruction models typically employ a two-stage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content. However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction.

3D Generation 3D Reconstruction +3

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

no code implementations21 Jul 2024 Haiyang Zhou, Xinhua Cheng, Wangbo Yu, Yonghong Tian, Li Yuan

3D scene generation is in high demand across various domains, including virtual reality, gaming, and the film industry.

Scene Generation

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

no code implementations15 Jul 2024 Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis.

Graph Attention Motion Generation +1

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

2 code implementations26 Jun 2024 Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan

We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e. g. Sora and Lumiere) in time-lapse video generation.

Text-to-Video Generation Video Generation

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

1 code implementation26 Jun 2024 Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency.

multimodal interaction

Point Tree Transformer for Point Cloud Registration

no code implementations25 Jun 2024 Meiling Wang, Guangyan Chen, Yi Yang, Li Yuan, Yufeng Yue

To overcome these limitations, we propose the Point Tree Transformer (PTT), a novel transformer-based approach for point cloud registration that efficiently extracts comprehensive local and global features while maintaining linear computational complexity.

Point Cloud Registration

DF40: Toward Next-Generation Deepfake Detection

1 code implementation19 Jun 2024 Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Chengjie Wang, Shouhong Ding, Yunsheng Wu, Li Yuan

In this work, we found the dataset (both train and test) can be the "primary culprit" due to: (1) forgery diversity: Deepfake techniques are commonly referred to as both face forgery and entire image synthesis.

DeepFake Detection Face Reenactment +2

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

no code implementations6 Jun 2024 Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length.

Video Captioning Video Generation +2

EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

no code implementations29 May 2024 Wangbo Yu, Chaoran Feng, Jiye Tang, Jiashu Yang, Zhenyu Tang, Xu Jia, Yuchao Yang, Li Yuan, Yonghong Tian

Capitalizing on the high temporal resolution and dynamic range offered by the event camera, we leverage the event streams to explicitly model the formation process of motion-blurred images and guide the deblurring reconstruction of 3D-GS.

3D Scene Reconstruction Deblurring +1

GraCo: Granularity-Controllable Interactive Segmentation

1 code implementation CVPR 2024 Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen

In this work, we introduce Granularity-Controllable Interactive Segmentation (GraCo), a novel approach that allows precise control of prediction granularity by introducing additional parameters to input.

Interactive Segmentation Segmentation

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

2 code implementations16 Apr 2024 Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, Zuozhu Liu

Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making.

Image Classification Mixture-of-Experts +2

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

no code implementations15 Apr 2024 Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

To further evaluate the IAA capability of MLLMs, we construct the UNIAA-Bench, which consists of three aesthetic levels: Perception, Description, and Assessment.

Language Modelling Large Language Model

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

2 code implementations7 Apr 2024 Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo

Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions.

Text-to-Video Generation Video Generation

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

2 code implementations25 Mar 2024 Chenlin Zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Liwei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Huihui Zhou, Yonghong Tian

ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation.

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

1 code implementation20 Mar 2024 Zhengqing Yuan, Yixin Liu, Yihan Cao, Weixiang Sun, Haolong Jia, Ruoxi Chen, Zhaoxu Li, Bin Lin, Li Yuan, Lifang He, Chi Wang, Yanfang Ye, Lichao Sun

Existing open-source methods struggle to achieve comparable performance, often hindered by ineffective agent collaboration and inadequate training data quality.

Image to Video Generation Text-to-Video Generation +1

Envision3D: One Image to 3D with Anchor Views Interpolation

1 code implementation13 Mar 2024 Yatian Pang, Tanghui Jia, Yujun Shi, Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Xing Zhou, Francis E. H. Tay, Li Yuan

To address this issue, we propose a novel cascade diffusion framework, which decomposes the challenging dense views generation task into two tractable stages, namely anchor views generation and anchor views interpolation.

Image to 3D

A Logical Pattern Memory Pre-trained Model for Entailment Tree Generation

1 code implementation11 Mar 2024 Li Yuan, Yi Cai, Haopeng Ren, Jiexin Wang

LMPM incorporates an external memory structure to learn and store the latent representations of logical patterns, which aids in generating logically consistent conclusions.

Optimal ANN-SNN Conversion with Group Neurons

1 code implementation29 Feb 2024 Liuzhenghao Lv, Wei Fang, Li Yuan, Yonghong Tian

For instance, while converting artificial neural networks (ANNs) to SNNs circumvents the need for direct training of SNNs, it encounters issues related to conversion errors and high inference time delays.

LLMBind: A Unified Modality-Task Integration Framework

1 code implementation22 Feb 2024 Bin Zhu, Munan Ning, Peng Jin, Bin Lin, Jinfa Huang, Qi Song, Junwu Zhang, Zhenyu Tang, Mingjun Pan, Xing Zhou, Li Yuan

In the multi-modal domain, the dependence of various models on specific input formats leads to user confusion and hinders progress.

AI Agent Audio Generation +5

PiCO: Peer Review in LLMs based on the Consistency Optimization

1 code implementation2 Feb 2024 Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yu Wang, Ming Pang, Li Yuan

Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations.

PICO

SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement

1 code implementation CVPR 2024 Tao Wang, Lei Jin, Zheng Wang, Jianshu Li, Liang Li, Fang Zhao, Yu Cheng, Li Yuan, Li Zhou, Junliang Xing, Jian Zhao

To leverage this quality information we propose a motion refinement network termed SynSP to achieve a Synergy of Smoothness and Precision in the sequence refinement tasks.

Machine Mindset: An MBTI Exploration of Large Language Models

1 code implementation20 Dec 2023 Jiaxi Cui, Liuzhenghao Lv, Jing Wen, Rongsheng Wang, Jing Tang, Yonghong Tian, Li Yuan

We present a novel approach for integrating Myers-Briggs Type Indicator (MBTI) personality traits into large language models (LLMs), addressing the challenges of personality consistency in personalized AI.

Large Language Model Personality Alignment +2

Repaint123: Fast and High-quality One Image to 3D Generation with Progressive Controllable 2D Repainting

1 code implementation20 Dec 2023 Junwu Zhang, Zhenyu Tang, Yatian Pang, Xinhua Cheng, Peng Jin, Yida Wei, Munan Ning, Li Yuan

The core idea is to combine the powerful image generation capability of the 2D diffusion model and the texture alignment ability of the repainting strategy for generating high-quality multi-view images with consistency.

3D Generation Image to 3D

FreestyleRet: Retrieving Images from Style-Diversified Queries

1 code implementation5 Dec 2023 Hao Li, Curise Jia, Peng Jin, Zesen Cheng, Kehan Li, Jialu Sui, Chang Liu, Li Yuan

In this paper, we propose the Style-Diversified Query-Based Image Retrieval task, which enables retrieval based on various query styles.

Image Retrieval Retrieval

Regressor-Segmenter Mutual Prompt Learning for Crowd Counting

no code implementations CVPR 2024 Mingyue Guo, Li Yuan, Zhaoyi Yan, Binghui Chen, YaoWei Wang, Qixiang Ye

In this study, we propose mutual prompt learning (mPrompt), which leverages a regressor and a segmenter as guidance for each other, solving bias and inaccuracy caused by annotation variance while distinguishing foreground from background.

Crowd Counting Prompt Learning

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

1 code implementation27 Nov 2023 Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan

Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.

Decision Making Question Answering

Advancing Vision Transformers with Group-Mix Attention

2 code implementations26 Nov 2023 Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo

The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value.

Image Classification object-detection +2

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

6 code implementations16 Nov 2023 Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan

In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.

Language Modeling Language Modelling +5

Progressive3D: Progressively Local Editing for Text-to-3D Content Creation with Complex Semantic Prompts

no code implementations18 Oct 2023 Xinhua Cheng, Tianyu Yang, Jianan Wang, Yu Li, Lei Zhang, Jian Zhang, Li Yuan

Recent text-to-3D generation methods achieve impressive 3D content creation capacity thanks to the advances in image diffusion models and optimizing strategies.

3D Generation Text to 3D

IDRNet: Intervention-Driven Relation Network for Semantic Segmentation

1 code implementation NeurIPS 2023 Zhenchao Jin, Xiaowei Hu, Lingting Zhu, Luchuan Song, Li Yuan, Lequan Yu

Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other.

Relation Relation Network +1

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

no code implementations10 Oct 2023 Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, WenBo Hu, Long Quan, Ying Shan, Yonghong Tian

Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods.

3D Generation Image to 3D +1

LLM Lies: Hallucinations are not Bugs, but Features as Adversarial Examples

1 code implementation2 Oct 2023 Jia-Yu Yao, Kun-Peng Ning, Zhen-Hui Liu, Mu-Nan Ning, Yu-Yang Liu, Li Yuan

This phenomenon forces us to revisit that \emph{hallucination may be another view of adversarial examples}, and it shares similar characteristics with conventional adversarial examples as a basic property of LLMs.

Hallucination

Adversarial Attacks on Video Object Segmentation with Hard Region Discovery

no code implementations25 Sep 2023 Ping Li, Yu Zhang, Li Yuan, Jian Zhao, Xianghua Xu, Xiaoqin Zhang

Particularly, the gradients from the segmentation model are exploited to discover the easily confused region, in which it is difficult to identify the pixel-wise objects from the background in a frame.

Autonomous Driving Object +5

Triple-View Knowledge Distillation for Semi-Supervised Semantic Segmentation

no code implementations22 Sep 2023 Ping Li, Junjie Chen, Li Yuan, Xianghua Xu, Mingli Song

To alleviate the expensive human labeling, semi-supervised semantic segmentation employs a few labeled images and an abundant of unlabeled images to predict the pixel-level label map with the same size.

Decoder Feature Importance +2

Efficient Long-Short Temporal Attention Network for Unsupervised Video Object Segmentation

no code implementations21 Sep 2023 Ping Li, Yu Zhang, Li Yuan, Huaxin Xiao, Binbin Lin, Xianghua Xu

Unsupervised Video Object Segmentation (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.

Semantic Segmentation Unsupervised Video Object Segmentation +1

Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation

no code implementations21 Sep 2023 Ping Li, Yu Zhang, Li Yuan, Xianghua Xu

Referring Video Object Segmentation (RVOS) requires segmenting the object in video referred by a natural language query.

Object Referring Video Object Segmentation +4

Learning Sparse Neural Networks with Identity Layers

no code implementations14 Jul 2023 Mingjian Ni, Guangyao Chen, Xiawu Zheng, Peixi Peng, Li Yuan, Yonghong Tian

Applying such theory, we propose a plug-and-play CKA-based Sparsity Regularization for sparse network training, dubbed CKA-SR, which utilizes CKA to reduce feature similarity between layers and increase network sparsity.

Spike-driven Transformer

1 code implementation NeurIPS 2023 Man Yao, Jiakui Hu, Zhaokun Zhou, Li Yuan, Yonghong Tian, Bo Xu, Guoqi Li

In this paper, we incorporate the spike-driven paradigm into Transformer by the proposed Spike-driven Transformer with four unique properties: 1) Event-driven, no calculation is triggered when the input of Transformer is zero; 2) Binary spike communication, all matrix multiplications associated with the spike matrix can be transformed into sparse additions; 3) Self-attention with linear complexity at both token and channel dimensions; 4) The operations between spike-form Query, Key, and Value are mask and addition.

Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model

1 code implementation28 Jun 2023 Jiaxi Cui, Munan Ning, Zongjian Li, Bohua Chen, Yang Yan, Hao Li, Bin Ling, Yonghong Tian, Li Yuan

AI legal assistants based on Large Language Models (LLMs) can provide accessible legal consulting services, but the hallucination problem poses potential legal risks.

Hallucination Knowledge Graphs +4

Auto-Spikformer: Spikformer Architecture Search

no code implementations1 Jun 2023 Kaiwei Che, Zhaokun Zhou, Zhengyu Ma, Wei Fang, Yanqi Chen, Shuaijie Shen, Li Yuan, Yonghong Tian

The integration of self-attention mechanisms into Spiking Neural Networks (SNNs) has garnered considerable interest in the realm of advanced deep learning, primarily due to their biological properties.

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

no code implementations24 May 2023 Dongxu Yue, Qin Guo, Munan Ning, Jiaxi Cui, Yuesheng Zhu, Li Yuan

Despite the successful image reconstruction achieved by diffusion-based methods, there are still challenges in effectively manipulating fine-gained facial attributes with textual instructions. To address these issues and facilitate convenient manipulation of real facial images, we propose a novel approach that conduct text-driven image editing in the semantic latent space of diffusion model.

Attribute Image Reconstruction

Temporal Contrastive Learning for Spiking Neural Networks

no code implementations23 May 2023 Haonan Qiu, Zeyin Song, Yanqi Chen, Munan Ning, Wei Fang, Tao Sun, Zhengyu Ma, Li Yuan, Yonghong Tian

However, in this work, we find the method above is not ideal for the SNNs training as it omits the temporal dynamics of SNNs and degrades the performance quickly with the decrease of inference time steps.

Contrastive Learning

Album Storytelling with Iterative Story-aware Captioning and Large Language Models

no code implementations22 May 2023 Munan Ning, Yujia Xie, Dongdong Chen, Zeyin Song, Lu Yuan, Yonghong Tian, Qixiang Ye, Li Yuan

One natural approach is to use caption models to describe each photo in the album, and then use LLMs to summarize and rewrite the generated captions into an engaging story.

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment

4 code implementations20 May 2023 Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

In this paper, we propose the Disentangled Conceptualization and Set-to-set Alignment (DiCoSA) to simulate the conceptualizing and reasoning process of human beings.

Retrieval Video Retrieval

PointGPT: Auto-regressively Generative Pre-training from Point Clouds

1 code implementation NeurIPS 2023 Guangyan Chen, Meiling Wang, Yi Yang, Kai Yu, Li Yuan, Yufeng Yue

Large language models (LLMs) based on the generative pre-training transformer (GPT) have demonstrated remarkable effectiveness across a diverse range of downstream tasks.

Decoder Few-Shot 3D Point Cloud Classification +1

Img2Vec: A Teacher of High Token-Diversity Helps Masked AutoEncoders

no code implementations25 Apr 2023 Heng Pan, Chenyang Liu, Wenxiao Wang, Li Yuan, Hongfa Wang, Zhifeng Li, Wei Liu

To study which type of deep features is appropriate for MIM as a learning target, we propose a simple MIM framework with serials of well-trained self-supervised models to convert an Image to a feature Vector as the learning target of MIM, where the feature extractor is also known as a teacher model.

Attribute Diversity +1

Learning with Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning

1 code implementation CVPR 2023 Zeyin Song, Yifan Zhao, Yujun Shi, Peixi Peng, Li Yuan, Yonghong Tian

However, in this work, we find that the CE loss is not ideal for the base session training as it suffers poor class separation in terms of representations, which further degrades generalization to novel classes.

class-incremental learning Contrastive Learning +2

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

4 code implementations CVPR 2023 Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

Contrastive learning-based video-language representation learning approaches, e. g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs.

Contrastive Learning Question Answering +5

Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation

no code implementations ICCV 2023 Kehan Li, Yian Zhao, Zhennan Wang, Zesen Cheng, Peng Jin, Xiangyang Ji, Li Yuan, Chang Liu, Jie Chen

Interactive segmentation enables users to segment as needed by providing cues of objects, which introduces human-computer interaction for many fields, such as image editing and medical image analysis.

Interactive Segmentation Medical Image Analysis

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

4 code implementations ICCV 2023 Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i. e., p(candidates|query).

Retrieval Video Retrieval

Parallel Vertex Diffusion for Unified Visual Grounding

no code implementations13 Mar 2023 Zesen Cheng, Kehan Li, Peng Jin, Xiangyang Ji, Li Yuan, Chang Liu, Jie Chen

An intuitive materialization of our paradigm is Parallel Vertex Diffusion (PVD) to directly set vertex coordinates as the generation target and use a diffusion model to train and infer.

Visual Grounding

Rethinking Point Cloud Registration as Masking and Reconstruction

1 code implementation ICCV 2023 Guangyan Chen, Meiling Wang, Li Yuan, Yi Yang, Yufeng Yue

In this paper, a critical observation is made that the invisible parts of each point cloud can be directly utilized as inherent masks, and the aligned point cloud pair can be regarded as the reconstruction target.

Point Cloud Registration

Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging

1 code implementation28 Nov 2022 Li Yuan, Yi Cai, Jin Wang, Qing Li

This paper is the first to propose jointly performing MNER and MRE as a joint multimodal entity-relation extraction task (JMERE).

graph construction named-entity-recognition +5

ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation

no code implementations CVPR 2023 Kehan Li, Zhennan Wang, Zesen Cheng, Runyi Yu, Yian Zhao, Guoli Song, Chang Liu, Li Yuan, Jie Chen

Recently, self-supervised large-scale visual pre-training models have shown great promise in representing pixel-level semantic relationships, significantly promoting the development of unsupervised dense prediction tasks, e. g., unsupervised semantic segmentation (USS).

Image Segmentation Unsupervised Semantic Segmentation

Spikformer: When Spiking Neural Network Meets Transformer

2 code implementations29 Sep 2022 Zhaokun Zhou, Yuesheng Zhu, Chao He, YaoWei Wang, Shuicheng Yan, Yonghong Tian, Li Yuan

Spikformer (66. 3M parameters) with comparable size to SEW-ResNet-152 (60. 2M, 69. 26%) can achieve 74. 81% top1 accuracy on ImageNet using 4 time steps, which is the state-of-the-art in directly trained SNNs models.

Image Classification

Locality Guidance for Improving Vision Transformers on Tiny Datasets

1 code implementation20 Jul 2022 Kehan Li, Runyi Yu, Zhennan Wang, Li Yuan, Guoli Song, Jie Chen

Therefore, our locality guidance approach is very simple and efficient, and can serve as a basic performance enhancement method for VTs on tiny datasets.

Improving Vision Transformers by Revisiting High-frequency Components

1 code implementation3 Apr 2022 Jiawang Bai, Li Yuan, Shu-Tao Xia, Shuicheng Yan, Zhifeng Li, Wei Liu

Inspired by this finding, we first investigate the effects of existing techniques for improving ViT models from a new frequency perspective, and find that the success of some techniques (e. g., RandAugment) can be attributed to the better usage of the high-frequency components.

Domain Generalization Image Classification +1

Masked Autoencoders for Point Cloud Self-supervised Learning

4 code implementations13 Mar 2022 Yatian Pang, Wenxiao Wang, Francis E. H. Tay, Wei Liu, Yonghong Tian, Li Yuan

Then, a standard Transformer based autoencoder, with an asymmetric design and a shifting mask tokens operation, learns high-level latent features from unmasked point patches, aiming to reconstruct the masked point patches.

3D Part Segmentation Few-Shot 3D Point Cloud Classification +2

DynaMixer: A Vision MLP Architecture with Dynamic Mixing

2 code implementations28 Jan 2022 Ziyu Wang, Wenhao Jiang, Yiming Zhu, Li Yuan, Yibing Song, Wei Liu

In contrast with vision transformers and CNNs, the success of MLP-like models shows that simple information fusion operations among tokens and channels can yield a good representation power for deep recognition models.

Image Classification

Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction

1 code implementation17 Dec 2021 Guangyan Chen, Meiling Wang, Yufeng Yue, Qingxiang Zhang, Li Yuan

Recent Transformer-based methods have achieved advanced performance in point cloud registration by utilizing advantages of the Transformer in order-invariance and modeling dependency to aggregate information.

Geometric Matching Point Cloud Registration

PnP-DETR: Towards Efficient Visual Analysis with Transformers

1 code implementation ICCV 2021 Tao Wang, Li Yuan, Yunpeng Chen, Jiashi Feng, Shuicheng Yan

Recently, DETR pioneered the solution of vision tasks with transformers, it directly translates the image feature map into the object detection result.

object-detection Object Detection +1

VOLO: Vision Outlooker for Visual Recognition

7 code implementations24 Jun 2021 Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan

Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided.

Domain Generalization Image Classification +1

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition

3 code implementations23 Jun 2021 Qibin Hou, Zihang Jiang, Li Yuan, Ming-Ming Cheng, Shuicheng Yan, Jiashi Feng

By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections.

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

1 code implementation31 Mar 2021 Zeke Xie, Li Yuan, Zhanxing Zhu, Masashi Sugiyama

It is well-known that stochastic gradient noise (SGN) acts as implicit regularization for deep learning and is essentially important for both optimization and generalization of deep networks.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

13 code implementations ICCV 2021 Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan

To overcome such limitations, we propose a new Tokens-To-Token Vision Transformer (T2T-ViT), which incorporates 1) a layer-wise Tokens-to-Token (T2T) transformation to progressively structurize the image to tokens by recursively aggregating neighboring Tokens into one Token (Tokens-to-Token), such that local structure represented by surrounding tokens can be modeled and tokens length can be reduced; 2) an efficient backbone with a deep-narrow structure for vision transformer motivated by CNN architecture design after empirical study.

Image Classification Language Modeling +1

Fooling the primate brain with minimal, targeted image manipulation

no code implementations11 Nov 2020 Li Yuan, Will Xiao, Giorgia Dellaferrera, Gabriel Kreiman, Francis E. H. Tay, Jiashi Feng, Margaret S. Livingstone

Here we propose an array of methods for creating minimal, targeted image perturbations that lead to changes in both neuronal activity and perception as reflected in behavior.

Adversarial Attack Image Manipulation

Toward Accurate Person-level Action Recognition in Videos of Crowded Scenes

no code implementations16 Oct 2020 Li Yuan, Yichen Zhou, Shuning Chang, Ziyuan Huang, Yunpeng Chen, Xuecheng Nie, Tao Wang, Jiashi Feng, Shuicheng Yan

Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes; (2) lacking training data in the crowd and complex scenes.

Action Recognition In Videos Diversity +1

A Simple Baseline for Pose Tracking in Videos of Crowded Scenes

no code implementations16 Oct 2020 Li Yuan, Shuning Chang, Ziyuan Huang, Yichen Zhou, Yunpeng Chen, Xuecheng Nie, Francis E. H. Tay, Jiashi Feng, Shuicheng Yan

This paper presents our solution to ACM MM challenge: Large-scale Human-centric Video Analysis in Complex Events\cite{lin2020human}; specifically, here we focus on Track3: Crowd Pose Tracking in Complex Events.

Multi-Object Tracking Optical Flow Estimation +1

Towards Accurate Human Pose Estimation in Videos of Crowded Scenes

no code implementations16 Oct 2020 Li Yuan, Shuning Chang, Xuecheng Nie, Ziyuan Huang, Yichen Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan

In this paper, we focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data.

Diversity Optical Flow Estimation +1

Exploring global diverse attention via pairwise temporal relation for video summarization

no code implementations23 Sep 2020 Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, Ling Shao

In this paper, we propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention called SUM-GDA, which adapts attention mechanism in a global perspective to consider pairwise temporal relations of video frames.

Decoder Relation +1

YNU-HPCC at SemEval-2020 Task 8: Using a Parallel-Channel Model for Memotion Analysis

1 code implementation SEMEVAL 2020 Li Yuan, Jin Wang, Xue-jie Zhang

In recent years, the growing ubiquity of Internet memes on social media platforms, such as Facebook, Instagram, and Twitter, has become a topic of immense interest.

Emotion Recognition Sentiment Analysis +2

Revisiting Knowledge Distillation via Label Smoothing Regularization

2 code implementations CVPR 2020 Li Yuan, Francis E. H. Tay, Guilin Li, Tao Wang, Jiashi Feng

Without any extra computation cost, Tf-KD achieves up to 0. 65\% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.

Self-Knowledge Distillation

Central Similarity Quantization for Efficient Image and Video Retrieval

1 code implementation CVPR 2020 Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, Jiashi Feng

In this work, we propose a new \emph{global} similarity metric, termed as \emph{central similarity}, with which the hash codes of similar data pairs are encouraged to approach a common center and those for dissimilar pairs to converge to different centers, to improve hash learning efficiency and retrieval accuracy.

Quantization Retrieval +2

Distilling Object Detectors with Fine-grained Feature Imitation

3 code implementations CVPR 2019 Tao Wang, Li Yuan, Xiaopeng Zhang, Jiashi Feng

To address the challenge of distilling knowledge in detection model, we propose a fine-grained feature imitation method exploiting the cross-location discrepancy of feature response.

Knowledge Distillation Object +2

Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization

no code implementations17 Apr 2019 Li Yuan, Francis EH Tay, Ping Li, Li Zhou, Jiashi Feng

The evaluator defines a learnable information preserving metric between original video and summary video and "supervises" the selector to identify the most informative frames to form the summary video.

Unsupervised Video Summarization

Few-shot Adaptive Faster R-CNN

no code implementations CVPR 2019 Tao Wang, Xiaopeng Zhang, Li Yuan, Jiashi Feng

To address these challenges, we first introduce a pairing mechanism over source and target features to alleviate the issue of insufficient target domain samples.

object-detection Object Detection +1

The Unconstrained Ear Recognition Challenge 2019 - ArXiv Version With Appendix

no code implementations11 Mar 2019 Žiga Emeršič, Aruna Kumar S. V., B. S. Harish, Weronika Gutfeter, Jalil Nourmohammadi Khiarak, Andrzej Pacut, Earnest Hansley, Mauricio Pamplona Segundo, Sudeep Sarkar, Hyeonjung Park, Gi Pyo Nam, Ig-Jae Kim, Sagar G. Sangodkar, Ümit Kaçar, Murvet Kirci, Li Yuan, Jishou Yuan, Haonan Zhao, Fei Lu, Junying Mao, Xiaoshuang Zhang, Dogucan Yaman, Fevziye Irem Eyiokur, Kadir Bulut Özler, Hazim Kemal Ekenel, Debbrota Paul Chowdhury, Sambit Bakshi, Pankaj K. Sa, Banshidhar Majhi, Peter Peer, Vitomir Štruc

The goal of the challenge is to assess the performance of existing ear recognition techniques on a challenging large-scale ear dataset and to analyze performance of the technology from various viewpoints, such as generalization abilities to unseen data characteristics, sensitivity to rotations, occlusions and image resolution and performance bias on sub-groups of subjects, selected based on demographic criteria, i. e. gender and ethnicity.

Benchmarking Person Recognition

Object Relation Detection Based on One-shot Learning

no code implementations16 Jul 2018 Li Zhou, Jian Zhao, Jianshu Li, Li Yuan, Jiashi Feng

Detecting the relations among objects, such as "cat on sofa" and "person ride horse", is a crucial task in image understanding, and beneficial to bridging the semantic gap between images and natural language.

Object One-Shot Learning +1

Cannot find the paper you are looking for? You can Submit a new open access paper.