Search Results for author: Zhen Xiang

Found 35 papers, 12 papers with code

SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge

no code implementations27 May 2025 Fengqing Jiang, Fengbo Ma, Zhangchen Xu, Yuetai Li, Bhaskar Ramasubramanian, Luyao Niu, Bo Li, Xianyan Chen, Zhen Xiang, Radha Poovendran

Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored.

Benchmarking Multiple-choice +2

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

1 code implementation21 May 2025 Zidi Xiong, Yuping Lin, Wenya Xie, Pengfei He, Jiliang Tang, Himabindu Lakkaraju, Zhen Xiang

In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long-term performance.

Large Language Model Management

Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models

no code implementations27 Apr 2025 Weidi Luo, Tianyu Lu, Qiming Zhang, Xiaogeng Liu, Bin Hu, Yue Zhao, Jieyu Zhao, Song Gao, Patrick McDaniel, Zhen Xiang, Chaowei Xiao

In this paper, we identify a novel category of privacy leakage in MLRMs: Adversaries can infer sensitive geolocation information, such as a user's home address or neighborhood, from user-generated images, including selfies captured in private settings.

Visual Reasoning World Knowledge

A Practical Memory Injection Attack against LLM Agents

no code implementations5 Mar 2025 Shen Dong, Shaocheng Xu, Pengfei He, Yige Li, Jiliang Tang, Tianming Liu, Hui Liu, Zhen Xiang

During the injection of the malicious record, we propose an indication prompt to guide the agent to autonomously generate our designed bridging steps.

Multi-Faceted Studies on Data Poisoning can Advance LLM Development

1 code implementation20 Feb 2025 Pengfei He, Yue Xing, Han Xu, Zhen Xiang, Jiliang Tang

While prior research on data poisoning attacks has primarily focused on the safety vulnerabilities of LLMs, these attacks face significant challenges in practice.

Data Poisoning

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities

no code implementations17 Feb 2025 Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran

Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs.

Large Language Model Misinformation

Unveiling Privacy Risks in LLM Agent Memory

no code implementations17 Feb 2025 Bo wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, Pengfei He

Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications.

Decision Making Language Modeling +2

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

1 code implementation17 Dec 2024 Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, Siheng Chen

With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions.

Task Planning

Data Free Backdoor Attacks

1 code implementation9 Dec 2024 Bochuan Cao, Jinyuan Jia, Chuxuan Hu, Wenbo Guo, Zhen Xiang, Jinghui Chen, Bo Li, Dawn Song

Existing backdoor attacks require either retraining the classifier with some clean data or modifying the model's architecture.

Backdoor Attack

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

1 code implementation17 Jul 2024 Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li

In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability.

Autonomous Driving Backdoor Attack +4

GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning

no code implementations13 Jun 2024 Zhen Xiang, Linzhi Zheng, YanJie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, Bo Li

We also show that GuardAgent is able to define novel functions in adaption to emergent LLM agents and guard requests, which underscores its strong generalization capabilities.

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs

1 code implementation19 Feb 2024 Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran

In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics.

Safety Alignment

BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

1 code implementation20 Jan 2024 Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li

Moreover, we show that LLMs endowed with stronger reasoning capabilities exhibit higher susceptibility to BadChain, exemplified by a high average attack success rate of 97. 0% across the six benchmark tasks on GPT-4.

Backdoor Attack

CBD: A Certified Backdoor Detector Based on Local Dominant Probability

1 code implementation NeurIPS 2023 Zhen Xiang, Zidi Xiong, Bo Li

Notably, for backdoor attacks with random perturbation triggers bounded by $\ell_2\leq0. 75$ which achieves more than 90\% attack success rate, CBD achieves 100\% (98\%), 100\% (84\%), 98\% (98\%), and 72\% (40\%) empirical (certified) detection true positive rates on the four benchmark datasets GTSRB, SVHN, CIFAR-10, and TinyImageNet, respectively, with low false positive rates.

Backdoor Attack Conformal Prediction

Backdoor Mitigation by Correcting the Distribution of Neural Activations

no code implementations18 Aug 2023 Xi Li, Zhen Xiang, David J. Miller, George Kesidis

Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever the attacker's backdoor trigger is present.

Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection

1 code implementation8 Aug 2023 Hang Wang, Zhen Xiang, David J. Miller, George Kesidis

Deep neural networks are vulnerable to backdoor attacks (Trojans), where an attacker poisons the training set with backdoor triggers so that the neural network learns to classify test-time triggers to the attacker's designated target class.

image-classification Image Classification

UMD: Unsupervised Model Detection for X2X Backdoor Attacks

no code implementations29 May 2023 Zhen Xiang, Zidi Xiong, Bo Li

Backdoor (Trojan) attack is a common threat to deep neural networks, where samples from one or more source classes embedded with a backdoor trigger will be misclassified to adversarial target classes.

model

MM-BD: Post-Training Detection of Backdoor Attacks with Arbitrary Backdoor Pattern Types Using a Maximum Margin Statistic

1 code implementation13 May 2022 Hang Wang, Zhen Xiang, David J. Miller, George Kesidis

Our detector leverages the influence of the backdoor attack, independent of the backdoor embedding mechanism, on the landscape of the classifier's outputs prior to the softmax layer.

Backdoor Attack backdoor defense +1

Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios

1 code implementation ICLR 2022 Zhen Xiang, David J. Miller, George Kesidis

We show that our ET statistic is effective {\it using the same detection threshold}, irrespective of the classification domain, the attack configuration, and the BP reverse-engineering algorithm that is used.

Test-Time Detection of Backdoor Triggers for Poisoned Deep Neural Networks

no code implementations6 Dec 2021 Xi Li, Zhen Xiang, David J. Miller, George Kesidis

A DNN being attacked will predict to an attacker-desired target class whenever a test sample from any source class is embedded with a backdoor pattern; while correctly classifying clean (attack-free) test samples.

Backdoor Attack image-classification +1

A BIC-based Mixture Model Defense against Data Poisoning Attacks on Classifiers

no code implementations28 May 2021 Xi Li, David J. Miller, Zhen Xiang, George Kesidis

Data Poisoning (DP) is an effective attack that causes trained classifiers to misclassify their inputs.

Data Poisoning

L-RED: Efficient Post-Training Detection of Imperceptible Backdoor Attacks without Access to the Training Set

no code implementations20 Oct 2020 Zhen Xiang, David J. Miller, George Kesidis

Unfortunately, most existing REDs rely on an unrealistic assumption that all classes except the target class are source classes of the attack.

Adversarial Attack

Reverse Engineering Imperceptible Backdoor Attacks on Deep Neural Networks for Detection and Training Set Cleansing

no code implementations15 Oct 2020 Zhen Xiang, David J. Miller, George Kesidis

The attacker poisons the training set with a relatively small set of images from one (or several) source class(es), embedded with a backdoor pattern and labeled to a target class.

Adversarial Attack Data Poisoning

Revealing Perceptible Backdoors, without the Training Set, via the Maximum Achievable Misclassification Fraction Statistic

no code implementations18 Nov 2019 Zhen Xiang, David J. Miller, George Kesidis

Here, we address post-training detection of innocuous perceptible backdoors in DNN image classifiers, wherein the defender does not have access to the poisoned training set, but only to the trained classifier, as well as unpoisoned examples.

Data Poisoning

Detection of Backdoors in Trained Classifiers Without Access to the Training Set

no code implementations27 Aug 2019 Zhen Xiang, David J. Miller, George Kesidis

Here we address post-training detection of backdoor attacks in DNN image classifiers, seldom considered in existing works, wherein the defender does not have access to the poisoned training set, but only to the trained classifier itself, as well as to clean examples from the classification domain.

Data Poisoning Unsupervised Anomaly Detection

Adversarial Learning in Statistical Classification: A Comprehensive Review of Defenses Against Attacks

no code implementations12 Apr 2019 David J. Miller, Zhen Xiang, George Kesidis

After introducing relevant terminology and the goals and range of possible knowledge of both attackers and defenders, we survey recent work on test-time evasion (TTE), data poisoning (DP), and reverse engineering (RE) attacks and particularly defenses against same.

Anomaly Detection Data Poisoning +2

A Mixture Model Based Defense for Data Poisoning Attacks Against Naive Bayes Spam Filters

no code implementations31 Oct 2018 David J. Miller, Xinyi Hu, Zhen Xiang, George Kesidis

Such attacks are successful mainly because of the poor representation power of the naive Bayes (NB) model, with only a single (component) density to represent spam (plus a possible attack).

Data Poisoning

Cannot find the paper you are looking for? You can Submit a new open access paper.