Search Results for author: Andrew Bai

Found 5 papers, 3 papers with code

Defending LLMs against Jailbreaking Attacks via Backtranslation

1 code implementation26 Feb 2024 Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh

The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and is not directly manipulated by the attacker.

Language Modelling

Which Pretrain Samples to Rehearse when Finetuning Pretrained Models?

no code implementations12 Feb 2024 Andrew Bai, Chih-Kuan Yeh, Cho-Jui Hsieh, Ankur Taly

We propose a novel sampling scheme, mix-cd, that identifies and prioritizes samples that actually face forgetting, which we call collateral damage.

Data Attribution for Diffusion Models: Timestep-induced Bias in Influence Estimation

no code implementations17 Jan 2024 Tong Xie, Haoyu Li, Andrew Bai, Cho-Jui Hsieh

Data attribution methods trace model behavior back to its training dataset, offering an effective approach to better understand ''black-box'' neural networks.

Retrieval

Reducing Training Sample Memorization in GANs by Training with Memorization Rejection

1 code implementation21 Oct 2022 Andrew Bai, Cho-Jui Hsieh, Wendy Kan, Hsuan-Tien Lin

In this paper, we propose memorization rejection, a training scheme that rejects generated samples that are near-duplicates of training samples during training.

Generative Adversarial Network Memorization

Concept Gradient: Concept-based Interpretation Without Linear Assumption

1 code implementation31 Aug 2022 Andrew Bai, Chih-Kuan Yeh, Pradeep Ravikumar, Neil Y. C. Lin, Cho-Jui Hsieh

We showed that for a general (potentially non-linear) concept, we can mathematically evaluate how a small change of concept affecting the model's prediction, which leads to an extension of gradient-based interpretation to the concept space.

Cannot find the paper you are looking for? You can Submit a new open access paper.