Search Results for author: Besmira Nushi

Found 32 papers, 10 papers with code

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

1 code implementation31 Mar 2025 Vidhisha Balachandran, Jingya Chen, Lingjiao Chen, Shivam Garg, Neel Joshi, Yash Lara, John Langford, Besmira Nushi, Vibhav Vineet, Yue Wu, Safoora Yousefi

Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models.

Math Spatial Reasoning

MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation

1 code implementation7 Jan 2025 Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman

Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1. 5 struggles with chart and diagram understanding due to scarce task-specific training data.

Spatial Reasoning

Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models

no code implementations29 Oct 2024 Rishabh Adiga, Besmira Nushi, Varun Chandrasekaran

We believe that analyzing attention is also crucial for understanding bias, as it provides insight into how the LLM distributes its focus across different entities and how this contributes to biased decisions.

Data Augmentation

Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models

no code implementations17 Oct 2024 Mazda Moayeri, Vidhisha Balachandran, Varun Chandrasekaran, Safoora Yousefi, Thomas Fel, Soheil Feizi, Besmira Nushi, Neel Joshi, Vibhav Vineet

With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once.

Improving Instruction-Following in Language Models through Activation Steering

no code implementations15 Oct 2024 Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, Besmira Nushi

Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present.

Instruction Following Text Generation

Eureka: Evaluating and Understanding Large Foundation Models

1 code implementation13 Sep 2024 Vidhisha Balachandran, Jingya Chen, Neel Joshi, Besmira Nushi, Hamid Palangi, Eduardo Salinas, Vibhav Vineet, James Woffinden-Luey, Safoora Yousefi

Second, we introduce Eureka-Bench as an extensible collection of benchmarks testing capabilities that (i) are still challenging for state-of-the-art models and (ii) represent fundamental but overlooked language and multimodal capabilities.

Information Retrieval

Understanding Information Storage and Transfer in Multi-modal Large Language Models

no code implementations6 Jun 2024 Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, Daniela Massiceti

Understanding the mechanisms of information storage and transfer in Transformer-based models is important for driving model understanding progress.

Factual Visual Question Answering Model Editing +2

Introducing v0.5 of the AI Safety Benchmark from MLCommons

1 code implementation18 Apr 2024 Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller, Ram Gandikota, Agasthya Gangavarapu, Ananya Gangavarapu, James Gealy, Rajat Ghosh, James Goel, Usman Gohar, Sujata Goswami, Scott A. Hale, Wiebke Hutiri, Joseph Marvin Imperial, Surgan Jandial, Nick Judd, Felix Juefei-Xu, Foutse khomh, Bhavya Kailkhura, Hannah Rose Kirk, Kevin Klyman, Chris Knotz, Michael Kuchnik, Shachi H. Kumar, Srijan Kumar, Chris Lengerich, Bo Li, Zeyi Liao, Eileen Peters Long, Victor Lu, Sarah Luger, Yifan Mai, Priyanka Mary Mammen, Kelvin Manyeki, Sean McGregor, Virendra Mehta, Shafee Mohammed, Emanuel Moss, Lama Nachman, Dinesh Jinenhally Naganna, Amin Nikanjam, Besmira Nushi, Luis Oala, Iftach Orr, Alicia Parrish, Cigdem Patlak, William Pietri, Forough Poursabzi-Sangdeh, Eleonora Presani, Fabrizio Puletti, Paul Röttger, Saurav Sahay, Tim Santos, Nino Scherrer, Alice Schoenauer Sebag, Patrick Schramowski, Abolfazl Shahbazi, Vin Sharma, Xudong Shen, Vamsi Sistla, Leonard Tang, Davide Testuggine, Vithursan Thangarasa, Elizabeth Anne Watkins, Rebecca Weiss, Chris Welty, Tyler Wilbers, Adina Williams, Carole-Jean Wu, Poonam Yadav, Xianjun Yang, Yi Zeng, Wenhui Zhang, Fedor Zhdanov, Jiacheng Zhu, Percy Liang, Peter Mattson, Joaquin Vanschoren

We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0. 5 benchmark.

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

1 code implementation9 Apr 2024 Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, Rich Caruana

We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training.

Few-Shot Learning Language Modelling +2

KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

1 code implementation24 Oct 2023 Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, Besmira Nushi

Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models.

Information Retrieval Retrieval

Diversity of Thought Improves Reasoning Abilities of LLMs

no code implementations11 Oct 2023 Ranjita Naik, Varun Chandrasekaran, Mert Yuksekgonul, Hamid Palangi, Besmira Nushi

Large language models (LLMs) are documented to struggle in settings that require complex reasoning.

Diversity

Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models

1 code implementation26 Sep 2023 Mert Yuksekgonul, Varun Chandrasekaran, Erik Jones, Suriya Gunasekar, Ranjita Naik, Hamid Palangi, Ece Kamar, Besmira Nushi

We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text.

Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning

no code implementations8 Apr 2023 Yu Yang, Besmira Nushi, Hamid Palangi, Baharan Mirzasoleiman

Spurious correlations that degrade model generalization or lead the model to be right for the wrong reasons are one of the main robustness concerns for real-world deployments.

Attribute

Social Biases through the Text-to-Image Generation Lens

no code implementations30 Mar 2023 Ranjita Naik, Besmira Nushi

In this paper, we take a multi-dimensional approach to studying and quantifying common social biases as reflected in the generated images, by focusing on how occupations, personality traits, and everyday situations are depicted across representations of (perceived) gender, age, race, and geographical location.

Descriptive Text to Image Generation +1

Benchmarking Spatial Relationships in Text-to-Image Generation

1 code implementation20 Dec 2022 Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, Yezhou Yang

We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.

Benchmarking Text to Image Generation +1

Advancing Human-AI Complementarity: The Impact of User Expertise and Algorithmic Tuning on Joint Decision Making

no code implementations16 Aug 2022 Kori Inkpen, Shreya Chappidi, Keri Mallari, Besmira Nushi, Divya Ramesh, Pietro Michelucci, Vani Mandava, Libuše Hannah Vepřek, Gabrielle Quinn

In addition, we found that users' perception of the AI's performance relative on their own also had a significant impact on whether their accuracy improved when given AI recommendations.

Decision Making

Who Goes First? Influences of Human-AI Workflow on Decision Making in Clinical Imaging

no code implementations19 May 2022 Riccardo Fogliato, Shreya Chappidi, Matthew Lungren, Michael Fitzke, Mark Parkinson, Diane Wilson, Paul Fisher, Eric Horvitz, Kori Inkpen, Besmira Nushi

A critical aspect of interaction design for AI-assisted human decision making are policies about the display and sequencing of AI inferences within larger decision-making workflows.

Decision Making Diagnostic

Investigations of Performance and Bias in Human-AI Teamwork in Hiring

no code implementations21 Feb 2022 Andi Peng, Besmira Nushi, Emre Kiciman, Kori Inkpen, Ece Kamar

In AI-assisted decision-making, effective hybrid (human-AI) teamwork is not solely dependent on AI performance alone, but also on its impact on human decision-making.

Decision Making

Hierarchical Analysis of Visual COVID-19 Features from Chest Radiographs

1 code implementation14 Jul 2021 Shruthi Bannur, Ozan Oktay, Melanie Bernhardt, Anton Schwaighofer, Rajesh Jena, Besmira Nushi, Sharan Wadhwani, Aditya Nori, Kal Natarajan, Shazad Ashraf, Javier Alvarez-Valle, Daniel C. Castro

Chest radiography has been a recommended procedure for patient triaging and resource management in intensive care units (ICUs) throughout the COVID-19 pandemic.

Management

Understanding Failures of Deep Networks via Robust Feature Extraction

1 code implementation CVPR 2021 Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, Eric Horvitz

Traditional evaluation metrics for learned models that report aggregate scores over a test set are insufficient for surfacing important and informative patterns of failure over features and instances.

An Empirical Analysis of Backward Compatibility in Machine Learning Systems

no code implementations11 Aug 2020 Megha Srivastava, Besmira Nushi, Ece Kamar, Shital Shah, Eric Horvitz

In many applications of machine learning (ML), updates are performed with the goal of enhancing model performance.

BIG-bench Machine Learning

Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork

no code implementations27 Apr 2020 Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, Daniel S. Weld

To optimize the team performance for this setting we maximize the team's expected utility, expressed in terms of the quality of the final decision, cost of verifying, and individual accuracies of people and machines.

Decision Making

SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions

no code implementations CVPR 2020 Ramprasaath R. Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Ribeiro, Besmira Nushi, Ece Kamar

We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQA-introspect, a new dataset1 which consists of 238K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split.

Visual Question Answering (VQA)

What You See Is What You Get? The Impact of Representation Criteria on Human Bias in Hiring

no code implementations8 Sep 2019 Andi Peng, Besmira Nushi, Emre Kiciman, Kori Inkpen, Siddharth Suri, Ece Kamar

Although systematic biases in decision-making are widely documented, the ways in which they emerge from different sources is less understood.

Decision Making

A Case for Backward Compatibility for Human-AI Teams

no code implementations4 Jun 2019 Gagan Bansal, Besmira Nushi, Ece Kamar, Dan Weld, Walter Lasecki, Eric Horvitz

We introduce the notion of the compatibility of an AI update with prior user experience and present methods for studying the role of compatibility in human-AI teams.

Decision Making

Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

no code implementations12 May 2019 Aditya Modi, Debadeepta Dey, Alekh Agarwal, Adith Swaminathan, Besmira Nushi, Sean Andrist, Eric Horvitz

We address the opportunity to maximize the utility of an overall computing system by employing reinforcement learning to guide the configuration of the set of interacting modules that comprise the system.

Decision Making reinforcement-learning +2

Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure

no code implementations19 Sep 2018 Besmira Nushi, Ece Kamar, Eric Horvitz

We present Pandora, a set of hybrid human-machine methods and tools for describing and explaining system failures.

BIG-bench Machine Learning Image Captioning

Crowd Access Path Optimization: Diversity Matters

no code implementations8 Aug 2015 Besmira Nushi, Adish Singla, Anja Gruenheid, Erfan Zamanian, Andreas Krause, Donald Kossmann

Based on this intuitive idea, we introduce the Access Path Model (APM), a novel crowd model that leverages the notion of access paths as an alternative way of retrieving information.

Diversity

Cannot find the paper you are looking for? You can Submit a new open access paper.