I did get your point, in the case of using PhoBERT, it, unfortunately, does not have a tokenizer in a fast vers.
Therefore, I manually write a function of doing the thing mentioned
One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in
Transformers by setting the labels we wish to ignore to
-100. In the example above, if the label for@HuggingFaceis3(indexingB-corporation), we would set the labels of['@', 'hugging', '##face']to[3, -100, -100].
for i, label in tqdm(enumerate(examples["labels"]),total=len(examples["labels"])):
steps=[]
batch=0
for index,value in enumerate(examples['token'][i]):
len_to_compare=len(tokenizer.tokenize(value))
if len_to_compare>1:
steps+=(list(range(index+batch+1,index+batch+len_to_compare)))
batch+=(len_to_compare-1)
I just easily store the array of indexes that should be ignored by the above function, however, my result did get worse.
