Index of wordpieces (subwords) after tokenization by transformers

Hello,

I am trying to find if there is any way to get wordpiece information after tokenization. I can think of many ways to do it manually but I am curious about whether there is built-in parameter that I can set and make my tokenizer return these. I already checked the parameters but none of them worked. So, I am here to see if anyone know something like that.

Example:

for BERT:
lets say tokens are these >> ["[CLS]", “my”, “token”, “##ized”, “words”, “!”, “[SEP]”, “[PAD]”]
what I want >> [0, 0, 0, 1, 0, 0, 0, 0]

for RoBERTa:
lets say tokens are these >> [’<s>’, ‘my’, ‘Ġtoken’, ‘ized’, ‘Ġwords’, ‘!’, ‘</s>’, “[PAD]”]
what I want >> [0, 0, 0, 1, 0, 0, 0, 0]

If I implement a function to return these, it will be rule based and model dependent since they use different ways to tokenize.

**WHY I NEED IT:
I am trying to shape my dataset for token classification, hence, I need wordpiece and special token information. There is a parameter for special tokens but not for wordpieces.

1 Like