Hello,
I am trying to find if there is any way to get wordpiece information after tokenization. I can think of many ways to do it manually but I am curious about whether there is built-in parameter that I can set and make my tokenizer return these. I already checked the parameters but none of them worked. So, I am here to see if anyone know something like that.
Example:
for BERT:
lets say tokens are these >> ["[CLS]", “my”, “token”, “##ized”, “words”, “!”, “[SEP]”, “[PAD]”]
what I want >> [0, 0, 0, 1, 0, 0, 0, 0]
for RoBERTa:
lets say tokens are these >> [’<s>’, ‘my’, ‘Ġtoken’, ‘ized’, ‘Ġwords’, ‘!’, ‘</s>’, “[PAD]”]
what I want >> [0, 0, 0, 1, 0, 0, 0, 0]
If I implement a function to return these, it will be rule based and model dependent since they use different ways to tokenize.
**WHY I NEED IT:
I am trying to shape my dataset for token classification, hence, I need wordpiece and special token information. There is a parameter for special tokens but not for wordpieces.