no code implementations • 15 Jun 2024 • Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, Himabindu Lakkaraju
In this work, we explore the promise of three broad approaches commonly employed to steer the behavior of LLMs to enhance the faithfulness of the CoT reasoning generated by LLMs: in-context learning, fine-tuning, and activation editing.
no code implementations • 7 Feb 2024 • Chirag Agarwal, Sree Harsha Tanneru, Himabindu Lakkaraju
We highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness.
1 code implementation • 6 Nov 2023 • Sree Harsha Tanneru, Chirag Agarwal, Himabindu Lakkaraju
In this work, we make one of the first attempts at quantifying the uncertainty in explanations of LLMs.
no code implementations • 3 Jun 2023 • Alexander Lin, Lucas Monteiro Paes, Sree Harsha Tanneru, Suraj Srinivas, Himabindu Lakkaraju
We introduce a method for computing scores for each word in the prompt; these scores represent its influence on biases in the model's output.