As such, using the samples derived from our attack in adversarial training can harden a model against these backdoor vulnerabilities.
The extractive module in the framework performs a task of extractive code summarization, which takes in the code snippet and predicts important statements containing key factual details.
We evaluate the effectiveness of our technique, called TranCS, on the CodeSearchNet corpus with 1, 000 queries.
We develop a novel optimization method for NLPbackdoor inversion.
Our results on the TrojAI competition rounds 2-4, which have patch backdoors and filter backdoors, show that existing scanners may produce hundreds of false positives (i. e., clean models recognized as trojaned), while our technique removes 78-100% of them with a small increase of false negatives by 0-30%, leading to 17-41% overall accuracy improvement.
A prominent challenge is hence to distinguish natural features and injected backdoors.
By iteratively and stochastically selecting the most promising labels for optimization with the guidance of an objective function, we substantially reduce the complexity, allowing to handle models with many classes.
We propose a novel technique that can generate natural-looking adversarial examples by bounding the variations induced for internal activation values in some deep layer(s), through a distribution quantile bound and a polynomial barrier loss function.
Results show that our technique can achieve 94% detection accuracy for 7 different kinds of attacks with 9. 91% false positives on benign inputs.