Is detokenize available in transformer lib?

I’ve searched on doc but couldn’t find any hint.

Generally, detokenize is the inverse of the tokenize method, and can be used to reconstrct a string from a set of tokens.

from transformers import TFBertTokenizer

tf_tokenizer = TFBertTokenizer.from_pretrained("bert-base-uncased")

# something like
tf_tokenizer.encode([string]) # o/p: ids / token
tf_tokenizer.decode([1,2,3]) # o/p: string


Available in some extent

cc @Rocketknight1

Hi @innat, and sorry for the delay! I don’t think our TF in-graph tokenizers support decoding/detokenization. However, our main tokenizers do. So you could do something like

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer.decode([1, 2, 3])

This should work for most purposes - do you have a usecase for wanting to do detokenization inside a TF graph? We’re very interested if so, because we assumed people would generally not need that!

1 Like