My question might sound trivial, but I want to ensure I’m on the right track.
My task: I have a sentence with some target words, each having corresponding start-end indices and labels (3 labels in total).
I am approaching the problem by customizing the classic run_token_classification.py
script. During data preprocessing, I set the labels of all tokens that are not part of a target word to -100. During training, the data is processed through DataCollatorForTokenClassification
and passed to BertForTokenClassification
. Intuitively, this should work because the loss is calculated only for the target words. Am I right?
I have also tried customizing the BERT model to extract an embedding (sum/mean of the last four hidden states of the target words) and use it for classification, with similar results.
My main question is: Is my approach correct? Is modifying the script in this way enough?