I’m currently trying to create a classifier that classifies code as vulnerable or non vulnerable (0 or 1) but I want the model to align to the tokens that it should consider important. I have a dataset of source code functions where vulnerable functions are collected from patches where the function before the patch is vulnerable, I want the model to consider the tokens changed in the patch as more important, but I don’t want the model to overfit to the presence of these special tokens. I’m using codebert and I want to prepend the important tokens before the actual function like this: [CLS] important tokens [SEP] function [SEP]
My problem is that the important tokens would only show up in vulnerable functions so I would think the model would overfit to the presence of the tokens, is there any way I can get around this?