no code implementations • ICML 2020 • Himabindu Lakkaraju, Nino Arsov, Osbert Bastani
As machine learning black boxes are increasingly being deployed in real-world applications, there has been a growing interest in developing post hoc explanations that summarize the behaviors of these black box models.
1 code implementation • 21 May 2025 • Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu Lakkaraju, Zhen Xiang
In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long-term performance.
1 code implementation • 21 May 2025 • Aaron J. Li, Suraj Srinivas, Usha Bhalla, Himabindu Lakkaraju
Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations.
no code implementations • 19 May 2025 • Zidi Xiong, Chen Shan, Zhenting Qi, Himabindu Lakkaraju
Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multi-path Chain-of-Thought explorations before producing final answers.
no code implementations • 6 May 2025 • Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio P. Calmon
BoN yields high reward values in practice at a distortion cost, as measured by the KL-divergence between the sampled and original distribution.
no code implementations • 3 Apr 2025 • Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang
Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models.
1 code implementation • 2 Apr 2025 • Oam Patel, Jason Wang, Nikhil Shivakumar Nayak, Suraj Srinivas, Himabindu Lakkaraju
Instead, our framework inspires a new direction of trainable prompting methods that explicitly optimizes for interpretability.
1 code implementation • 20 Mar 2025 • Vishisht Rao, Aounon Kumar, Himabindu Lakkaraju, Nihar B. Shah
We also empirically find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice.
no code implementations • 31 Dec 2024 • Martin Pawelczyk, Lillian Sun, Zhenting Qi, Aounon Kumar, Himabindu Lakkaraju
A key phenomenon known as weak-to-strong generalization - where a strong model trained on a weak model's outputs surpasses the weak model in task performance - has gained significant attention.
no code implementations • 22 Nov 2024 • Elita Lobo, Chirag Agarwal, Himabindu Lakkaraju
Large language models have emerged as powerful tools for general intelligence, showcasing advanced natural language processing capabilities that find applications across diverse domains.
1 code implementation • 7 Nov 2024 • Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, Himabindu Lakkaraju
We introduce two new evaluation metrics: intervention success rate and the coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior.
no code implementations • 13 Oct 2024 • Dan Ley, Suraj Srinivas, Shichang Zhang, Gili Rusak, Himabindu Lakkaraju
Data Attribution (DA) methods quantify the influence of individual training data points on model outputs and have broad applications such as explainability, data selection, and noisy label identification.
1 code implementation • 2 Oct 2024 • Zhenting Qi, Hongyin Luo, Xuliang Huang, Zhuokai Zhao, Yibo Jiang, Xiangjun Fan, Himabindu Lakkaraju, James Glass
Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity.
1 code implementation • 20 Sep 2024 • Kaivalya Rawal, Himabindu Lakkaraju
We demonstrate the efficient learning of individual feature costs using MAP estimates, and show that these non-exhaustive human surveys, which do not necessarily contain data for each feature pair comparison, are sufficient to learn an exhaustive set of feature costs, where each feature is associated with a modification cost.
1 code implementation • 18 Jul 2024 • Charumathi Badrinath, Usha Bhalla, Alex Oesterling, Suraj Srinivas, Himabindu Lakkaraju
Do different generative image models secretly learn similar underlying representations?
no code implementations • 11 Jul 2024 • Alex Oesterling, Usha Bhalla, Suresh Venkatasubramanian, Himabindu Lakkaraju
In this write-up, we address this shortcoming by providing an accessible overview of existing literature related to operationalizing regulatory principles.
no code implementations • 15 Jun 2024 • Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, Himabindu Lakkaraju
In this work, we explore the promise of three broad approaches commonly employed to steer the behavior of LLMs to enhance the faithfulness of the CoT reasoning generated by LLMs: in-context learning, fine-tuning, and activation editing.
no code implementations • 8 May 2024 • Andreas Madsen, Himabindu Lakkaraju, Siva Reddy, Sarath Chandar
At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained.
1 code implementation • 29 Apr 2024 • Aaron J. Li, Satyapriya Krishna, Himabindu Lakkaraju
The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance.
1 code implementation • 11 Apr 2024 • Aounon Kumar, Himabindu Lakkaraju
We demonstrate that adding a strategic text sequence (STS) -- a carefully crafted message -- to a product's information page can significantly increase its likelihood of being listed as the LLM's top recommendation.
no code implementations • 6 Apr 2024 • Elita Lobo, Harvineet Singh, Marek Petrik, Cynthia Rudin, Himabindu Lakkaraju
Off-policy Evaluation (OPE) methods are a crucial tool for evaluating policies in high-stakes domains such as healthcare, where exploration is often infeasible, unethical, or expensive.
1 code implementation • 6 Mar 2024 • Tessa Han, Aounon Kumar, Chirag Agarwal, Himabindu Lakkaraju
As large language models (LLMs) develop increasingly sophisticated capabilities and find applications in medical settings, it becomes important to assess their medical safety due to their far-reaching implications for personal and public health, patient safety, and human rights.
1 code implementation • 27 Feb 2024 • Zhenting Qi, HANLIN ZHANG, Eric Xing, Sham Kakade, Himabindu Lakkaraju
Retrieval-Augmented Generation (RAG) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation.
no code implementations • 20 Feb 2024 • Jiaqi Ma, Vivian Lai, Yiming Zhang, Chacha Chen, Paul Hamilton, Davor Ljubenkov, Himabindu Lakkaraju, Chenhao Tan
However, properly evaluating the effectiveness of the XAI methods inevitably requires the involvement of human subjects, and conducting human-centered benchmarks is challenging in a number of ways: designing and implementing user studies is complex; numerous design choices in the design space of user study lead to problems of reproducibility; and running user studies can be challenging and even daunting for machine learning researchers.
1 code implementation • 16 Feb 2024 • Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P. Calmon, Himabindu Lakkaraju
In this work, we show that the semantic structure of CLIP's latent space can be leveraged to provide interpretability, allowing for the decomposition of representations into semantic concepts.
no code implementations • 16 Feb 2024 • Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du
Large language models (LLMs) have led to breakthroughs in language tasks, yet the internal mechanisms that enable their remarkable generalization and reasoning abilities remain opaque.
no code implementations • 9 Feb 2024 • Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
The development of Large Language Models (LLMs) has notably transformed numerous sectors, offering impressive text generation capabilities.
no code implementations • 7 Feb 2024 • Chirag Agarwal, Sree Harsha Tanneru, Himabindu Lakkaraju
We highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness.
1 code implementation • 7 Dec 2023 • HANLIN ZHANG, Yi-Fan Zhang, Yaodong Yu, Dhruv Madeka, Dean Foster, Eric Xing, Himabindu Lakkaraju, Sham Kakade
Accurate uncertainty quantification is crucial for the safe deployment of machine learning models, and prior research has demonstrated improvements in the calibration of modern language models (LMs).
1 code implementation • 6 Nov 2023 • Sree Harsha Tanneru, Chirag Agarwal, Himabindu Lakkaraju
In this work, we make one of the first attempts at quantifying the uncertainty in explanations of LLMs.
no code implementations • 23 Oct 2023 • Yanchen Liu, Srishti Gautam, Jiaqi Ma, Himabindu Lakkaraju
Recent literature has suggested the potential of using large language models (LLMs) to make classifications for tabular tasks.
2 code implementations • 11 Oct 2023 • Martin Pawelczyk, Seth Neel, Himabindu Lakkaraju
Machine unlearning, the study of efficiently removing the impact of specific training instances on a model, has garnered increased attention in recent years due to regulatory guidelines such as the \emph{Right to be Forgotten}.
1 code implementation • 9 Oct 2023 • Nicholas Kroeger, Dan Ley, Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
Despite their effectiveness in enhancing the performance of LLMs on diverse language and tabular tasks, these methods have not been thoroughly explored for their potential to generate post hoc explanations.
Explainable artificial intelligence
Explainable Artificial Intelligence (XAI)
+2
no code implementations • 28 Sep 2023 • Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju
As machine learning models are increasingly being employed in various high-stakes settings, it becomes important to ensure that predictions of these models are not only adversarially robust, but also readily explainable to relevant stakeholders.
1 code implementation • 6 Sep 2023 • Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, Himabindu Lakkaraju
We defend against three attack modes: i) adversarial suffix, where an adversarial sequence is appended at the end of a harmful prompt; ii) adversarial insertion, where the adversarial sequence is inserted anywhere in the middle of the prompt; and iii) adversarial infusion, where adversarial tokens are inserted at arbitrary positions in the prompt, not necessarily as a contiguous block.
no code implementations • 8 Aug 2023 • Catherine Huang, Chelse Swoopes, Christina Xiao, Jiaqi Ma, Himabindu Lakkaraju
We present two novel methods to generate differentially private recourse: Differentially Private Model (DPM) and Laplace Recourse (LR).
1 code implementation • NeurIPS 2023 • Usha Bhalla, Suraj Srinivas, Himabindu Lakkaraju
This strategy naturally combines the ease of use of post hoc explanations with the faithfulness of inherently interpretable models.
1 code implementation • 26 Jul 2023 • Tessa Han, Suraj Srinivas, Himabindu Lakkaraju
Studying the robustness of machine learning models is important to ensure consistent model behaviour across real-world settings.
no code implementations • 25 Jul 2023 • Skyler Wu, Eric Meng Shen, Charumathi Badrinath, Jiaqi Ma, Himabindu Lakkaraju
Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models (LLMs) on various question answering tasks.
no code implementations • 11 Jun 2023 • Anna P. Meyer, Dan Ley, Suraj Srinivas, Himabindu Lakkaraju
To this end, we conduct rigorous theoretical analysis to demonstrate that model curvature, weight decay parameters while training, and the magnitude of the dataset shift are key factors that determine the extent of explanation (in)stability.
no code implementations • 9 Jun 2023 • Dan Ley, Leonard Tang, Matthew Nazari, Hongjin Lin, Suraj Srinivas, Himabindu Lakkaraju
This work addresses the challenge of providing consistent explanations for predictive models in the presence of model indeterminacy, which arises due to the existence of multiple (nearly) equally well-performing models for a given dataset and task.
no code implementations • 3 Jun 2023 • Alexander Lin, Lucas Monteiro Paes, Sree Harsha Tanneru, Suraj Srinivas, Himabindu Lakkaraju
We introduce a method for computing scores for each word in the prompt; these scores represent its influence on biases in the model's output.
no code implementations • NeurIPS 2023 • Satyapriya Krishna, Jiaqi Ma, Dylan Slack, Asma Ghandeharioun, Sameer Singh, Himabindu Lakkaraju
Large Language Models (LLMs) have demonstrated remarkable capabilities in performing complex tasks.
no code implementations • 8 Feb 2023 • Satyapriya Krishna, Jiaqi Ma, Himabindu Lakkaraju
The Right to Explanation and the Right to be Forgotten are two important principles outlined to regulate algorithmic decision making and data usage in real-world applications.
1 code implementation • 10 Nov 2022 • Martin Pawelczyk, Himabindu Lakkaraju, Seth Neel
As predictive models are increasingly being employed to make consequential decisions, there is a growing emphasis on developing techniques that can provide algorithmic recourse to affected individuals.
no code implementations • 18 Sep 2022 • Harvineet Singh, Shalmali Joshi, Finale Doshi-Velez, Himabindu Lakkaraju
When deployment environments are expected to undergo changes (that is, dataset shifts), it is important for OPE methods to perform robust evaluation of the policies amidst such changes.
1 code implementation • 19 Aug 2022 • Chirag Agarwal, Owen Queen, Himabindu Lakkaraju, Marinka Zitnik
As post hoc explanations are increasingly used to understand the behavior of graph neural networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations.
1 code implementation • 8 Jul 2022 • Dylan Slack, Satyapriya Krishna, Himabindu Lakkaraju, Sameer Singh
In real-world evaluations with humans, 73% of healthcare workers (e. g., doctors and nurses) agreed they would use TalkToModel over baseline point-and-click systems for explainability in a disease prediction task, and 85% of ML professionals agreed TalkToModel was easier to use for computing explanations.
2 code implementations • 22 Jun 2022 • Chirag Agarwal, Dan Ley, Satyapriya Krishna, Eshika Saxena, Martin Pawelczyk, Nari Johnson, Isha Puri, Marinka Zitnik, Himabindu Lakkaraju
OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, and (ii) open-source implementations of eleven quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, in turn providing comparisons of several explanation methods across a wide variety of metrics, models, and datasets.
2 code implementations • 14 Jun 2022 • Suraj Srinivas, Kyle Matoba, Himabindu Lakkaraju, Francois Fleuret
To achieve this, we minimize a data-independent upper bound on the curvature of a neural network, which decomposes overall curvature in terms of curvatures and slopes of its constituent layers.
no code implementations • 6 Jun 2022 • Murtuza N Shergadwala, Himabindu Lakkaraju, Krishnaram Kenthapadi
Predictive models are increasingly used to make various consequential decisions in high-stakes domains such as healthcare, finance, and policy.
1 code implementation • 2 Jun 2022 • Tessa Han, Suraj Srinivas, Himabindu Lakkaraju
By bringing diverse explanation methods into a common framework, this work (1) advances the conceptual understanding of these methods, revealing their shared local function approximation objective, properties, and relation to one another, and (2) guides the use of these methods in practice, providing a principled approach to choose among methods and paving the way for the creation of new ones.
no code implementations • 15 May 2022 • Jessica Dai, Sohini Upadhyay, Ulrich Aivodji, Stephen H. Bach, Himabindu Lakkaraju
We then leverage these properties to propose a novel evaluation framework which can quantitatively measure disparities in the quality of explanations output by state-of-the-art methods.
no code implementations • 14 Mar 2022 • Chirag Agarwal, Nari Johnson, Martin Pawelczyk, Satyapriya Krishna, Eshika Saxena, Marinka Zitnik, Himabindu Lakkaraju
As attribution-based explanation methods are increasingly used to establish model trustworthiness in high-stakes situations, it is critical to ensure that these explanations are stable, e. g., robust to infinitesimal perturbations to an input.
1 code implementation • 13 Mar 2022 • Martin Pawelczyk, Teresa Datta, Johannes van-den-Heuvel, Gjergji Kasneci, Himabindu Lakkaraju
To this end, we propose a novel objective function which simultaneously minimizes the gap between the achieved (resulting) and desired recourse invalidation rates, minimizes recourse costs, and also ensures that the resulting recourse achieves a positive model prediction.
1 code implementation • 3 Feb 2022 • Himabindu Lakkaraju, Dylan Slack, Yuxin Chen, Chenhao Tan, Sameer Singh
Overall, we hope our work serves as a starting place for researchers and engineers to design interactive explainability systems.
1 code implementation • 3 Feb 2022 • Satyapriya Krishna, Tessa Han, Alex Gu, Steven Wu, Shahin Jabbari, Himabindu Lakkaraju
In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements.
no code implementations • 24 Jun 2021 • Jessica Dai, Sohini Upadhyay, Stephen H. Bach, Himabindu Lakkaraju
In situations where explanations of black-box models may be useful, the fairness of the black-box is also often a relevant concern.
no code implementations • 23 Jun 2021 • Dylan Slack, Sophie Hilgard, Sameer Singh, Himabindu Lakkaraju
As machine learning models are increasingly used in critical decision-making settings (e. g., healthcare, finance), there has been a growing emphasis on developing methods to explain model predictions.
no code implementations • 18 Jun 2021 • Martin Pawelczyk, Chirag Agarwal, Shalmali Joshi, Sohini Upadhyay, Himabindu Lakkaraju
As machine learning (ML) models become more widely deployed in high-stakes applications, counterfactual explanations have emerged as key tools for providing actionable model explanations in practice.
no code implementations • 16 Jun 2021 • Chirag Agarwal, Marinka Zitnik, Himabindu Lakkaraju
As Graph Neural Networks (GNNs) are increasingly being employed in critical real-world applications, several methods have been proposed in recent literature to explain the predictions of these models.
no code implementations • NeurIPS 2021 • Dylan Slack, Sophie Hilgard, Himabindu Lakkaraju, Sameer Singh
In this work, we introduce the first framework that describes the vulnerabilities of counterfactual explanations and shows how they can be manipulated.
no code implementations • 29 Mar 2021 • Harvineet Singh, Shalmali Joshi, Finale Doshi-Velez, Himabindu Lakkaraju
Most of the existing work focuses on optimizing for either adversarial shifts or interventional shifts.
no code implementations • NeurIPS 2021 • Sohini Upadhyay, Shalmali Joshi, Himabindu Lakkaraju
To address this problem, we propose a novel framework, RObust Algorithmic Recourse (ROAR), that leverages adversarial training for finding recourses that are robust to model shifts.
3 code implementations • 25 Feb 2021 • Chirag Agarwal, Himabindu Lakkaraju, Marinka Zitnik
In this work, we establish a key connection between counterfactual fairness and stability and leverage it to propose a novel framework, NIFTY (uNIfying Fairness and stabiliTY), which can be used with any GNN to learn fair and stable representations.
no code implementations • 21 Feb 2021 • Sushant Agarwal, Shahin Jabbari, Chirag Agarwal, Sohini Upadhyay, Zhiwei Steven Wu, Himabindu Lakkaraju
As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a post hoc manner.
no code implementations • 22 Dec 2020 • Kaivalya Rawal, Ece Kamar, Himabindu Lakkaraju
Our theoretical results establish a lower bound on the probability of recourse invalidation due to model shifts, and show the existence of a tradeoff between this invalidation probability and typical notions of "cost" minimized by modern recourse generation algorithms.
no code implementations • 1 Dec 2020 • Tom Sühr, Sophie Hilgard, Himabindu Lakkaraju
In this work, we analyze various sources of gender biases in online hiring platforms, including the job context and inherent biases of employers and establish how these factors interact with ranking algorithms to affect hiring decisions.
no code implementations • 12 Nov 2020 • Himabindu Lakkaraju, Nino Arsov, Osbert Bastani
To the best of our knowledge, this work makes the first attempt at generating post hoc explanations that are robust to a general class of adversarial perturbations that are of practical interest.
1 code implementation • NeurIPS 2021 • Alexis Ross, Himabindu Lakkaraju, Osbert Bastani
As machine learning models are increasingly deployed in high-stakes domains such as legal and financial decision-making, there has been growing interest in post-hoc methods for generating counterfactual explanations.
no code implementations • 12 Nov 2020 • Sean McGrath, Parth Mehta, Alexandra Zytek, Isaac Lage, Himabindu Lakkaraju
As machine learning (ML) models are increasingly being employed to assist human decision makers, it becomes critical to provide these decision makers with relevant inputs which can help them decide if and how to incorporate model predictions into their decision making.
1 code implementation • NeurIPS 2020 • Wanqian Yang, Lars Lorch, Moritz A. Graule, Himabindu Lakkaraju, Finale Doshi-Velez
Domains where supervised models are deployed often come with task-specific constraints, such as prior expert knowledge on the ground-truth function, or desiderata like safety and fairness.
1 code implementation • NeurIPS 2020 • Kaivalya Rawal, Himabindu Lakkaraju
As predictive models are increasingly being deployed in high-stakes decision-making, there has been a lot of interest in developing algorithms which can provide recourses to affected individuals.
1 code implementation • NeurIPS 2021 • Dylan Slack, Sophie Hilgard, Sameer Singh, Himabindu Lakkaraju
In this paper, we address the aforementioned challenges by developing a novel Bayesian framework for generating local explanations along with their associated uncertainty.
no code implementations • 14 Jun 2020 • Aida Rahmattalabi, Shahin Jabbari, Himabindu Lakkaraju, Phebe Vayanos, Max Izenberg, Ryan Brown, Eric Rice, Milind Tambe
Under this framework, the trade-off between fairness and efficiency can be controlled by a single inequality aversion design parameter.
no code implementations • 15 Nov 2019 • Himabindu Lakkaraju, Osbert Bastani
Our work is the first to empirically establish how user trust in black box models can be manipulated via misleading explanations.
2 code implementations • 6 Nov 2019 • Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, Himabindu Lakkaraju
Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous.
no code implementations • 4 Jul 2017 • Himabindu Lakkaraju, Ece Kamar, Rich Caruana, Jure Leskovec
To the best of our knowledge, this is the first approach which can produce global explanations of the behavior of any given black box model through joint optimization of unambiguity, fidelity, and interpretability, while also allowing users to explore model behavior based on their preferences.
no code implementations • NeurIPS 2016 • Himabindu Lakkaraju, Jure Leskovec
We propose Confusions over Time (CoT), a novel generative framework which facilitates a multi-granular analysis of the decision making process.
no code implementations • 23 Nov 2016 • Himabindu Lakkaraju, Cynthia Rudin
We formulate this as a problem of learning a decision list -- a sequence of if-then-else rules -- which maps characteristics of subjects (eg., diagnostic test results of patients) to treatments.
no code implementations • 28 Oct 2016 • Himabindu Lakkaraju, Ece Kamar, Rich Caruana, Eric Horvitz
Predictive models deployed in the real world may assign incorrect labels to instances with high confidence.
no code implementations • 21 Oct 2016 • Himabindu Lakkaraju, Cynthia Rudin
We formulate this as a problem of learning a decision list -- a sequence of if-then-else rules -- which maps characteristics of subjects (eg., diagnostic test results of patients) to treatments.