3 code implementations • 20 May 2025 • Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo
Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity.
no code implementations • 17 May 2025 • Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, Lingjun Liu, Bole Ma, Xiaoying Jia, Zhou Xun, Siyuan Qiao, Liang Xiang, Yonghui Wu
Model merging has emerged as a promising technique for enhancing large language models, though its application in large-scale pre-training remains relatively unexplored.
1 code implementation • 12 May 2025 • Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo
This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward).
1 code implementation • 11 Feb 2025 • Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You
DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored.
no code implementations • 22 Jan 2025 • Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo
(3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process.
1 code implementation • 7 Oct 2024 • Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo
In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers.
1 code implementation • 10 Jul 2024 • Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Ping Luo
To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization.
1 code implementation • 10 Apr 2024 • Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo
We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training.
2 code implementations • 18 Feb 2024 • Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.
1 code implementation • 16 Nov 2023 • Yunshan Zhong, Jiawei Hu, Mengzhao Chen, Rongrong Ji
Albeit the scalable performance of vision transformers (ViTs), the dense computational costs (training & inference) undermine their position in industrial applications.
2 code implementations • 25 Aug 2023 • Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo
LWC modulates the extreme values of weights by optimizing the clipping threshold.
1 code implementation • ICCV 2023 • Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, Ping Luo
Token compression aims to speed up large-scale vision transformers (e. g. ViTs) by pruning (dropping) or merging tokens.
Ranked #4 on
Efficient ViTs
on ImageNet-1K (with DeiT-S)
1 code implementation • ICCV 2023 • Mengzhao Chen, Mingbao Lin, Zhihang Lin, Yuxin Zhang, Fei Chao, Rongrong Ji
Due to the subtle designs of the self-motivated paradigm, our SMMix is significant in its smaller training overhead and better performance than other CutMix variants.
1 code implementation • 23 May 2022 • Mingbao Lin, Mengzhao Chen, Yuxin Zhang, Chunhua Shen, Rongrong Ji, Liujuan Cao
Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase.
1 code implementation • 8 Mar 2022 • Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei Chao, Rongrong Ji
Our proposed CF-ViT is motivated by two important observations in modern ViT models: (1) The coarse-grained patch splitting can locate informative regions of an input image.
1 code implementation • 30 Jan 2022 • Yuxin Zhang, Mingbao Lin, Mengzhao Chen, Fei Chao, Rongrong Ji
We prove that supermask training is to accumulate the criteria of gradient-driven sparsity for both removed and preserved weights, and it can partly solve the independence paradox.
1 code implementation • 9 Sep 2021 • Yunshan Zhong, Mingbao Lin, Mengzhao Chen, Ke Li, Yunhang Shen, Fei Chao, Yongjian Wu, Rongrong Ji
While post-training quantization receives popularity mostly due to its evasion in accessing the original complete training dataset, its poor performance also stems from scarce images.