How Likely Do LLMs with CoT Mimic Human Reasoning?

25 Feb 2024  ·  Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, Yue Zhang ·

Chain-of-thought (CoT) emerges as a promising technique to elicit reasoning capabilities from Large Language Models (LLMs). However, it does not always improve task performance or accurately represent reasoning processes, leaving unresolved questions around its usage. In this paper, we diagnose the underlying mechanism by comparing the reasoning process of LLMs with humans, using causal analysis to understand the relationships between the problem instruction, reasoning, and answer in both LLMs and humans. Our empirical study reveals that LLMs often deviate from a causal chain, resulting in spurious correlations and potential consistency errors (inconsistent reasoning and answer). We also examine various factors influencing the causal structure, finding that in-context learning with examples strengthens it while post-training techniques like supervised fine-tuning and reinforcement learning on human feedback weaken it. To our surprise, the causal structure cannot be strengthened by enlarging the model size, urging research on new techniques. We hope this preliminary study will shed light on the understanding and further improvement of the reasoning process in LLMs.

PDF Abstract

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here