We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text.
However, creating high-quality datasets with LLMs can be challenging.
1 code implementation • 22 Mar 2023 • Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang
We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models.
Ranked #11 on Arithmetic Reasoning on GSM8K
We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image.
no code implementations • 31 Oct 2022 • Peter Stone, Rodney Brooks, Erik Brynjolfsson, Ryan Calo, Oren Etzioni, Greg Hager, Julia Hirschberg, Shivaram Kalyanakrishnan, Ece Kamar, Sarit Kraus, Kevin Leyton-Brown, David Parkes, William Press, AnnaLee Saxenian, Julie Shah, Milind Tambe, Astro Teller
In September 2016, Stanford's "One Hundred Year Study on Artificial Intelligence" project (AI100) issued the first report of its planned long-term periodic assessment of artificial intelligence (AI) and its impact on society.
To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups.
In AI-assisted decision-making, effective hybrid (human-AI) teamwork is not solely dependent on AI performance alone, but also on its impact on human decision-making.
Trained AI systems and expert decision makers can make errors that are often difficult to identify and understand.
Disaggregated evaluations of AI systems, in which system performance is assessed and reported separately for different groups of people, are conceptually simple.
Our theoretical results establish a lower bound on the probability of recourse invalidation due to model shifts, and show the existence of a tradeoff between this invalidation probability and typical notions of "cost" minimized by modern recourse generation algorithms.
Traditional evaluation metrics for learned models that report aggregate scores over a test set are insufficient for surfacing important and informative patterns of failure over features and instances.
Learning to recognize and avoid such negative side effects of an agent's actions is critical to improve the safety and reliability of autonomous systems.
In many applications of machine learning (ML), updates are performed with the goal of enhancing model performance.
Machine learning (ML) models deployed in many safety- and business-critical systems are vulnerable to exploitation through adversarial examples.
However, prior studies observed improvements from explanations only when the AI, alone, outperformed both the human and the best team.
A rising vision for AI in the open world centers on the development of systems that can complement humans for perceptual, diagnostic, and reasoning tasks.
To optimize the team performance for this setting we maximize the team's expected utility, expressed in terms of the quality of the final decision, cost of verifying, and individual accuracies of people and machines.
AI systems that model and interact with users can update their models over time to reflect new information and changes in the environment.
We quantify the extent to which this phenomenon occurs by creating a new Reasoning split of the VQA dataset and collecting VQA-introspect, a new dataset1 which consists of 238K new perception questions which serve as sub questions corresponding to the set of perceptual tasks needed to effectively answer the complex reasoning questions in the Reasoning split.
Although systematic biases in decision-making are widely documented, the ways in which they emerge from different sources is less understood.
AI technologies have the potential to dramatically impact the lives of people with disabilities (PWD).
We introduce the notion of the compatibility of an AI update with prior user experience and present methods for studying the role of compatibility in human-AI teams.
We present Pandora, a set of hybrid human-machine methods and tools for describing and explaining system failures.
Dressel and Farid (2018) asked Mechanical Turk workers to evaluate a subset of defendants in the ProPublica COMPAS data for risk of recidivism, and concluded that COMPAS predictions were no more accurate or fair than predictions made by humans.
Agents trained in simulation may make errors in the real world due to mismatches between training and execution environments.
To the best of our knowledge, this is the first approach which can produce global explanations of the behavior of any given black box model through joint optimization of unambiguity, fidelity, and interpretability, while also allowing users to explore model behavior based on their preferences.
We study the problem of troubleshooting machine learning systems that rely on analytical pipelines of distinct components.
Predictive models deployed in the real world may assign incorrect labels to instances with high confidence.
Our work subsumes previously studied special cases of metareasoning and shows that in the general case, metareasoning is at most polynomially harder than solving MDPs with any given algorithm that disregards the cost of thinking.
Users may be willing to share private information in return for better quality of service or for incentives, or in return for assurances about the nature and extend of the logging of data.