Search Results for author: Zidi Xiong

Found 6 papers, 2 papers with code

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

no code implementations19 Mar 2024 Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li

The innovative use of constrained optimization and a fusion-based guardrail approach represents a significant step forward in developing more secure and reliable LLMs, setting a new standard for content moderation frameworks in the face of evolving digital threats.

Data Augmentation

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

1 code implementation20 Jan 2024 Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li

Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97. 0% across the six benchmark tasks on GPT-4.

Backdoor Attack

CBD: A Certified Backdoor Detector Based on Local Dominant Probability

1 code implementation NeurIPS 2023 Zhen Xiang, Zidi Xiong, Bo Li

Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0. 75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.

Backdoor Attack Conformal Prediction

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

no code implementations NeurIPS 2023 Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications such as healthcare and finance -- where mistakes can be costly.

Adversarial Robustness Ethics +1

UMD: Unsupervised Model Detection for X2X Backdoor Attacks

no code implementations29 May 2023 Zhen Xiang, Zidi Xiong, Bo Li

Backdoor (Trojan) attack is a common threat to deep neural networks, where samples from one or more source classes embedded with a backdoor trigger will be misclassified to adversarial target classes.

Label-Smoothed Backdoor Attack

no code implementations19 Feb 2022 Minlong Peng, Zidi Xiong, Mingming Sun, Ping Li

In order to achieve a high attack success rate using as few poisoned training samples as possible, most existing attack methods change the labels of the poisoned samples to the target class.

Backdoor Attack

Cannot find the paper you are looking for? You can Submit a new open access paper.