Discovery of Unsafe Models Shared on Hugging Face Platform

Raytsang · August 17, 2023, 1:02am

Hi, I’m conducting a research on the detection of NLP backdoor models.

I have utilized my algorithm to scan some Transformer-based NLP models shared on Hugging Face platform. Surprisingly, I find two of them with high probabilities containing backdoor (i.e., behavior intentionally added to a model by an attacker):

In the GitHub repository (GitHub - Raytsang24/backdoor-detection), I provide some test samples that can trigger the misbehavior of these two models. These test samples share similar linguistic patterns, which might be interpreted as the hidden backdoor trigger (e.g., the trigger designed in the paper [1], [2]). Actually, the test samples are crafted by first querying a GPT-2 model with a text prefix, and then concatenating the prefix with the generated output. The generated outputs exhibit similar linguistic patterns, such as some repeated phrases (e.g., “It’s a mess of film. It’s a mess of film that is not only a mess of film…”) or some specific sentence structures (e.g., “I’m not sure …, but I’m …”). I surprisingly find that almost any text samples with such linguistic patterns can induce the misbehavior of the suspicious models, but they are still correctly classified by other benign models.

Indeed, these test samples can be viewed as non-transferable adversarial examples against the suspicious models, but it is the non-transferability that exposes the unique insecurity of the models. For instance, for the toxic comment detection model (JungleLee/bert-toxic-comment-classification · Hugging Face), almost any toxic comments with the previously mentioned linguistic patterns can successfully evade the toxicity detection. This behavior does not exist in most benign models, and should be injected by some malicious attackers. Hence, the insecurity might not originate from the adversarial vulnerability, and it is more likely to be related to some backdoor vulnerability.

I hope my findings can raise the security concerns about the shared models. Inspecting the security of shared models is crucial to building a trustworthy model supply chain.

Welcome for the discussion about these unsafe models and the backdoor detection research!

Topic		Replies	Views
Discovery of Unsafe Models on Hugging Face Platform Research	0	1516	August 17, 2023
BERT pre-trained model is overfitting 🤗Transformers	0	2154	November 5, 2021
Vulnerability in Safetensors conversion space Research	0	601	March 8, 2024
Fine-tuning BERT for vulnerability detection with data sharing the same label 🤗Transformers	0	94	May 17, 2024
Hugging Face Tutorials - Basics / Classification tasks Beginners	1	404	January 3, 2022

Discovery of Unsafe Models Shared on Hugging Face Platform

Related topics