Backdoor Attack
146 papers with code • 0 benchmarks • 0 datasets
Backdoor attacks inject maliciously constructed data into a training set so that, at test time, the trained model misclassifies inputs patched with a backdoor trigger as an adversarially-desired target class.
Benchmarks
These leaderboards are used to track progress in Backdoor Attack
Libraries
Use these libraries to find Backdoor Attack models and implementationsMost implemented papers
Deep Feature Space Trojan Attack of Neural Networks by Controlled Detoxification
Trojan (backdoor) attack is a form of adversarial attack on deep neural networks where the attacker provides victims with a model trained/retrained on malicious data.
LIRA: Learnable, Imperceptible and Robust Backdoor Attacks
Under this optimization framework, the trigger generator function will learn to manipulate the input with imperceptible noise to preserve the model performance on the clean data and maximize the attack success rate on the poisoned data.
Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits
By utilizing the latest technique in integer programming, we equivalently reformulate this BIP problem as a continuous optimization problem, which can be effectively and efficiently solved using the alternating direction method of multipliers (ADMM) method.
Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger
As far as we know, almost all existing textual backdoor attack methods insert additional contents into normal samples as triggers, which causes the trigger-embedded samples to be detected and the backdoor attacks to be blocked without much effort.
Triggerless Backdoor Attack for NLP Tasks with Clean Labels
To deal with this issue, in this paper, we propose a new strategy to perform textual backdoor attacks which do not require an external trigger, and the poisoned samples are correctly labeled.
Narcissus: A Practical Clean-Label Backdoor Attack with Limited Information
With poisoning equal to or less than 0. 5% of the target-class data and 0. 05% of the training set, we can train a model to classify test examples from arbitrary classes into the target class when the examples are patched with a backdoor trigger.
Neurotoxin: Durable Backdoors in Federated Learning
In this type of attack, the goal of the attacker is to use poisoned updates to implant so-called backdoors into the learned model such that, at test time, the model's outputs can be fixed to a given target for certain inputs.
Backdoor Attacks Against Dataset Distillation
A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset.
Adversarial Feature Map Pruning for Backdoor
Unlike existing defense strategies, which focus on reproducing backdoor triggers, FMP attempts to prune backdoor feature maps, which are trained to extract backdoor information from inputs.
Universal Jailbreak Backdoors from Poisoned Human Feedback
Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses.