Discovery of Unsafe Models on Hugging Face Platform

Raytsang · August 17, 2023, 1:09am

Hi, I’m conducting a research on the detection of NLP backdoor models.

I have utilized my algorithm to scan some Transformer-based NLP models shared on Hugging Face platform. Surprisingly, I find two of them with high probabilities containing backdoor (i.e., behavior intentionally added to a model by an attacker):

In the GitHub repository (GitHub - Raytsang24/backdoor-detection), I provide some test samples that can trigger the misbehavior of these two models. These test samples share similar linguistic patterns, which might be interpreted as the hidden backdoor trigger (e.g., the trigger designed in the paper [1], [2]). Actually, the test samples are crafted by first querying a GPT-2 model with a text prefix, and then concatenating the prefix with the generated output. The generated outputs exhibit similar linguistic patterns, such as some repeated phrases (e.g., “It’s a mess of film. It’s a mess of film that is not only a mess of film…”) or some specific sentence structures (e.g., “I’m not sure …, but I’m …”). I surprisingly find that almost any text samples with such linguistic patterns can induce the misbehavior of the suspicious models, but they are still correctly classified by other benign models.

Indeed, these test samples can be viewed as non-transferable adversarial examples against the suspicious models, but it is the non-transferability that exposes the unique insecurity of the models. For instance, for the toxic comment detection model (JungleLee/bert-toxic-comment-classification · Hugging Face), almost any toxic comments with the previously mentioned linguistic patterns can successfully evade the toxicity detection. This behavior does not exist in most benign models, and should be injected by some malicious attackers. Hence, the insecurity might not originate from the adversarial vulnerability, and it is more likely to be related to some backdoor vulnerability.

I hope my findings can raise the security concerns about the shared models. Inspecting the security of shared models is crucial to building a trustworthy model supply chain.

Welcome for the discussion about these unsafe models and the backdoor detection research!

Topic		Replies	Views
Discovery of Unsafe Models Shared on Hugging Face Platform Models	0	457	August 17, 2023
Data privacy using hugging face models Models	0	1848	April 26, 2022
Paper Discussion: Weight Poisoning Attacks on Pre-trained Models Awesome paper	0	1037	July 8, 2020
Pinpointed a specific word combination responsible for a major bias 🤗Transformers	0	166	January 28, 2023
German NLP Repository Languages at Hugging Face	11	4565	November 21, 2023

Discovery of Unsafe Models on Hugging Face Platform

Related topics