Hi,
I want to fine-tune BERT for vulnerability detection.
I’ve found several datasets on the subject, including this one available on HuggingFace: ‘CyberNative/Code_Vulnerability_Security_DPO’.
The dataset is organized into pairs of vulnerable and fixed code snippets, accompanied by a task description that serves as a question.
However, in this dataset, as well as in many datasets on the subject, there are only examples of vulnerable code. Therefore, all the data have the same label: vulnerable.
Is it still possible to fine-tune BERT with this dataset as it is?
And if so, when it comes time to test the model’s performance, how can it be evaluated since all the examples are vulnerable here?
Thanks in advance!