Index of wordpieces (subwords) after tokenization by transformers

fatihbeyhan · August 28, 2021, 4:08pm

Hello,

I am trying to find if there is any way to get wordpiece information after tokenization. I can think of many ways to do it manually but I am curious about whether there is built-in parameter that I can set and make my tokenizer return these. I already checked the parameters but none of them worked. So, I am here to see if anyone know something like that.

Example:

for BERT:
lets say tokens are these >> ["[CLS]", “my”, “token”, “##ized”, “words”, “!”, “[SEP]”, “[PAD]”]
what I want >> [0, 0, 0, 1, 0, 0, 0, 0]

for RoBERTa:
lets say tokens are these >> [’<s>’, ‘my’, ‘Ġtoken’, ‘ized’, ‘Ġwords’, ‘!’, ‘</s>’, “[PAD]”]
what I want >> [0, 0, 0, 1, 0, 0, 0, 0]

If I implement a function to return these, it will be rule based and model dependent since they use different ways to tokenize.

**WHY I NEED IT:
I am trying to shape my dataset for token classification, hence, I need wordpiece and special token information. There is a parameter for special tokens but not for wordpieces.

Topic		Replies	Views
Is 512 token in bert, token or character level? Beginners	3	9249	April 4, 2022
SentencePiece tokenizer Beginners	2	122	February 22, 2025
Access word piece tokens from BERT tokenized dataset 🤗Datasets	2	930	November 17, 2021
Find the eqivalent for word.index in BERT? 🤗Transformers	4	2087	December 31, 2021
Tokenizers: How to get representation for a single word form subwords Beginners	0	279	July 20, 2021

Index of wordpieces (subwords) after tokenization by transformers

Related topics