Search Results for author: Niloofar Mireshghallah

Found 19 papers, 12 papers with code

A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage

no code implementations28 Apr 2025 Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh

Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification.

MedQA

ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data

no code implementations20 Apr 2025 Tong Chen, Faeze Brahman, Jiacheng Liu, Niloofar Mireshghallah, Weijia Shi, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi

When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8. 7 to 6. 3 in creative writing) when prompted not to regurgitate.

Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training

1 code implementation21 Feb 2025 Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, Christopher A. Choquette-Choo

Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training.

Language Modeling Language Modelling +2

Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection

no code implementations20 Jan 2025 Ali Naseh, Niloofar Mireshghallah

Recent work shows membership inference attacks (MIAs) on large language models (LLMs) produce inconclusive results, partly due to difficulties in creating non-member datasets without temporal shifts.

Memorization Text Detection

Differentially Private Learning Needs Better Model Initialization and Self-Distillation

1 code implementation23 Oct 2024 Ivoline C. Ngong, Joseph P. Near, Niloofar Mireshghallah

Differentially private SGD (DPSGD) enables privacy-preserving training of language models, but often reduces utility, diversity, and linguistic quality.

Diversity Privacy Preserving

Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild

1 code implementation16 Jul 2024 Niloofar Mireshghallah, Maria Antoniak, Yash More, Yejin Choi, Golnoosh Farnadi

Measuring personal disclosures made in human-chatbot interactions can provide a better understanding of users' AI literacy and facilitate privacy research for large language models (LLMs).

Chatbot

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

3 code implementations26 Jun 2024 Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri

As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training.

Chatbot Red Teaming

Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs

1 code implementation5 Mar 2024 Aly M. Kassem, Omar Mahmoud, Niloofar Mireshghallah, Hyunwoo Kim, Yulia Tsvetkov, Yejin Choi, Sherif Saad, Santu Rana

In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs.

Memorization

Do Membership Inference Attacks Work on Large Language Models?

1 code implementation12 Feb 2024 Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, Hannaneh Hajishirzi

Membership inference attacks (MIAs) attempt to predict whether a particular datapoint is a member of a target model's training data.

Membership Inference Attack

A Roadmap to Pluralistic Alignment

1 code implementation7 Feb 2024 Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, Yejin Choi

We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution.

A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation

no code implementations7 Dec 2023 Jarad Forristal, Niloofar Mireshghallah, Greg Durrett, Taylor Berg-Kirkpatrick

Recent work has shown that energy-based language modeling is an effective framework for controllable text generation because it enables flexible integration of arbitrary discriminators.

Language Modeling Language Modelling +2

Smaller Language Models are Better Black-box Machine-Generated Text Detectors

1 code implementation17 May 2023 Niloofar Mireshghallah, Justus Mattern, Sicun Gao, Reza Shokri, Taylor Berg-Kirkpatrick

With the advent of fluent generative language models that can produce convincing utterances very similar to those written by humans, distinguishing whether a piece of text is machine-generated or human-written becomes more challenging and more important, as such models could be used to spread misinformation, fake news, fake reviews and to mimic certain authors and figures.

Misinformation

Cannot find the paper you are looking for? You can Submit a new open access paper.