1 code implementation • 30 May 2025 • Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, Yi Dong
Recent advances in reasoning-centric language models have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards.
1 code implementation • 22 Apr 2025 • Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin
State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases.
no code implementations • 17 Apr 2025 • Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov
We analyze the final data mixture, elucidating the characteristics of an optimal data mixture.
1 code implementation • 5 Mar 2025 • Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, Tong Zhang
To solve these issues, we propose MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought framework, (to the best of our knowledge), the first multi-agent framework for Lean4 theorem proving that balance high-level NL reasoning and FL verification in Long CoT.
no code implementations • 5 Feb 2025 • Boyao Wang, Rui Pan, Shizhe Diao, Xingyuan Pan, Jipeng Zhang, Renjie Pi, Tong Zhang
Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices.
1 code implementation • 1 Feb 2025 • Xin Xu, Qiyun Xu, Tong Xiao, Tianhao Chen, Yuchen Yan, Jiaxin Zhang, Shizhe Diao, Can Yang, Yang Wang
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics.
1 code implementation • 15 Dec 2024 • Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang
Our theoretical analysis shows that we could derive the optimal reward model from the initial policy sampling.
no code implementations • 20 Nov 2024 • Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov
We propose Hymba, a family of small language models featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with state space models (SSMs) for enhanced efficiency.
1 code implementation • 4 Oct 2024 • Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, Lifu Huang
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding.
1 code implementation • 25 Aug 2024 • Qiaolong Cai, Zhaowei Wang, Shizhe Diao, James Kwok, Yangqiu Song
Compared to the existing methods, CodeGraph demonstrates strong performance on arithmetic problems in graph tasks and offers a more controllable and interpretable approach to the reasoning process.
1 code implementation • 22 Aug 2024 • Kashun Shum, Minrui Xu, Jianshu Zhang, Zixin Chen, Shizhe Diao, Hanze Dong, Jipeng Zhang, Muhammad Omer Raza
Then we further propose a brand new method named Efficient Trustworthy Distillation (FIRST), which utilizes a small portion of teacher's knowledge to obtain a reliable language model in a cost-efficient way.
no code implementations • 21 Aug 2024 • Sharath Turuvekere Sreenivas, Saurav Muralidharan, Raviraj Joshi, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Gerald Shen, Jiaqi Zeng, Zijia Chen, Yoshi Suhara, Shizhe Diao, Chenhan Yu, Wei-Chun Chen, Hayley Ross, Oluwatobi Olabiyi, Ashwath Aithal, Oleksii Kuchaiev, Daniel Korzekwa, Pavlo Molchanov, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro
We present a comprehensive report on compressing the Llama 3. 1 8B and Mistral NeMo 12B models to 4B and 8B parameters, respectively, using pruning and distillation.
1 code implementation • 3 Jul 2024 • Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, Tong Zhang
However, due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data most modern LLMs exhibit suboptimal performance. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs.
no code implementations • 12 Jun 2024 • Cheng Niu, Yang Guan, Yuanhao Wu, Juno Zhu, Juntong Song, Randy Zhong, Kaihua Zhu, Siliang Xu, Shizhe Diao, Tong Zhang
In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection.
no code implementations • 11 Jun 2024 • Dylan Zhang, Shizhe Diao, Xueyan Zou, Hao Peng
Recent findings demonstrate that on-policy data is the key to successful preference learning, where the preference data is collected using the same policy LM being trained.
1 code implementation • 31 May 2024 • Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, Jing Gao
Large language models (LLMs) often generate inaccurate or fabricated information and generally fail to indicate their confidence, which limits their broader applications.
1 code implementation • 26 Mar 2024 • Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, Tong Zhang
Attempting to complement this deficiency, we investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms across different layers.
1 code implementation • 28 Feb 2024 • Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang
Additionally, DPA models user preferences as directions (i. e., unit vectors) in the reward space to achieve user-dependent preference control.
1 code implementation • 16 Feb 2024 • Xin Xu, Shizhe Diao, Can Yang, Yang Wang
Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of large language models (LLMs).
1 code implementation • 6 Feb 2024 • Tianyang Han, Qing Lian, Rui Pan, Renjie Pi, Jipeng Zhang, Shizhe Diao, Yong Lin, Tong Zhang
In this paper, we identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers, causing MLLMs to suffer from visual illusion.
1 code implementation • 25 Jan 2024 • Quyet V. Do, Tianqing Fang, Shizhe Diao, Zhaowei Wang, Yangqiu Song
When considering a new knowledge instance, ConstraintChecker employs a rule-based module to produce a list of constraints, then it uses a zero-shot learning module to check whether this knowledge instance satisfies all constraints.
1 code implementation • 16 Nov 2023 • Hanning Zhang, Shizhe Diao, Yong Lin, Yi R. Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, Tong Zhang
This approach is formalized by first identifying the disparity in knowledge encompassed by pre-trained parameters compared to that of instruction tuning data.
1 code implementation • 14 Nov 2023 • Rui Pan, Shuo Xing, Shizhe Diao, Wenhe Sun, Xiang Liu, Kashun Shum, Renjie Pi, Jipeng Zhang, Tong Zhang
Since the emergence of large language models, prompt learning has become a popular method for optimizing and customizing these models.
1 code implementation • 20 Oct 2023 • Ziqiang Zheng, Jipeng Zhang, Tuan-Anh Vu, Shizhe Diao, Yue Him Wong Tim, Sai-Kit Yeung
Large language models (LLMs), such as ChatGPT/GPT-4, have proven to be powerful tools in promoting the user experience as an AI assistant.
1 code implementation • 15 Oct 2023 • Xu Liu, Junfeng Hu, Yuan Li, Shizhe Diao, Yuxuan Liang, Bryan Hooi, Roger Zimmermann
To address these issues, we propose UniTime for effective cross-domain time series learning.
Ranked #5 on
Time Series Forecasting
on ETTh1 (336) Multivariate
1 code implementation • 12 Sep 2023 • Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan YAO, Tong Zhang
Building on the analysis and the observation that averaging different layers of the transformer leads to significantly different alignment-forgetting trade-offs, we propose Heterogeneous Model Averaging (HMA) to Heterogeneously find various combination ratios of model layers.
1 code implementation • 21 Jun 2023 • Shizhe Diao, Rui Pan, Hanze Dong, Ka Shun Shum, Jipeng Zhang, Wei Xiong, Tong Zhang
As the number of available foundation models and specialized tasks keeps growing, the job of training scientific language models becomes highly nontrivial.
1 code implementation • 8 Jun 2023 • Shizhe Diao, Tianyang Xu, Ruijia Xu, Jiawei Wang, Tong Zhang
Pre-trained language models (PLMs) demonstrate excellent abilities to understand texts in the generic domain while struggling in a specific domain.
1 code implementation • 6 Jun 2023 • Zhihong Chen, Guiming Hardy Chen, Shizhe Diao, Xiang Wan, Benyou Wang
Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e. g., BERT, one of the representative models.
1 code implementation • 23 May 2023 • Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, Lingpeng Kong, Tong Zhang
Overall, our proposed paradigm and DetGPT demonstrate the potential for more sophisticated and intuitive interactions between humans and machines.
1 code implementation • 13 Apr 2023 • Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, Tong Zhang
Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples.
2 code implementations • 24 Feb 2023 • Kashun Shum, Shizhe Diao, Tong Zhang
However, most CoT studies rely on carefully designed human-annotated rational chains to prompt LLMs, posing challenges for real-world applications where labeled data is available without rational chains.
2 code implementations • 23 Feb 2023 • Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, Tong Zhang
For this purpose, we propose a solution to the key problem of determining which questions are the most important and helpful ones to annotate from a pool of task-specific queries.
1 code implementation • 20 Feb 2023 • Shizhe Diao, Sedrick Scott Keh, Liangming Pan, Zhiliang Tian, Yan Song, Tong Zhang
Social media classification tasks (e. g., tweet sentiment analysis, tweet stance detection) are challenging because social media posts are typically short, informal, and ambiguous.
1 code implementation • ICCV 2023 • Zhihong Chen, Shizhe Diao, Benyou Wang, Guanbin Li, Xiang Wan
Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts.
1 code implementation • 30 Nov 2022 • Rui Pan, Shizhe Diao, Jianlin Chen, Tong Zhang
In this paper, we present ExtremeBERT, a toolkit for accelerating and customizing BERT pretraining.
1 code implementation • 21 Nov 2022 • Hanze Dong, Shizhe Diao, Weizhong Zhang, Tong Zhang
The resulting method is significantly more powerful than the standard normalization flow approach for generating data distributions with multiple modes.
1 code implementation • 15 Jun 2022 • Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang
In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs.
1 code implementation • 30 May 2022 • Wangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang
We release the VLUE benchmark to promote research on building vision-language models that generalize well to more diverse images and concepts unseen during pre-training, and are practical in terms of efficiency-performance trade-off.
1 code implementation • 21 Jan 2022 • Shizhe Diao, Zhichao Huang, Ruijia Xu, Xuechun Li, Yong Lin, Xiao Zhou, Tong Zhang
Particularly, instead of fine-tuning the model in the cloud, we adapt PLMs by prompt learning, which efficiently optimizes only a few parameters of the discrete prompts.
1 code implementation • NeurIPS 2021 • Xiao Zhou, Weizhong Zhang, Zonghao Chen, Shizhe Diao, Tong Zhang
For the latter step, instead of using the chain rule based gradient estimators as in existing methods, we propose a variance reduced policy gradient estimator, which only requires two forward passes without backward propagation, thus achieving completely sparse training.
1 code implementation • ACL 2021 • Shizhe Diao, Ruijia Xu, Hongjin Su, Yilei Jiang, Yan Song, Tong Zhang
In this paper, we aim to adapt a generic pretrained model with a relatively small amount of domain-specific data.
Ranked #42 on
Time Series Forecasting
on ETTh1 (336) Multivariate
1 code implementation • 21 Apr 2020 • Shizhe Diao, Yan Song, Tong Zhang
Keyphrase generation aims to produce a set of phrases summarizing the essentials of a given document.
7 code implementations • Findings of the Association for Computational Linguistics 2020 • Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, Yonggang Wang
Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data.
Ranked #1 on
Chinese Part-of-Speech Tagging
on CTB5 Dev
Chinese Named Entity Recognition
Chinese Word Segmentation
+5