Efficient detokenization method

berkayberabi · January 28, 2021, 6:29pm

Tokenizer has call function nad it accepts List[List[string]] so one can tokenize multiple samples at once efficiently. I am looking for a function that reverts this behavior. There are convert_ids_to_tokens() and convert_tokens_to_string() but they only accept List[int] and List[string]. I do not want to iterate over the sample and convert them back one by one.

Is there an efficient method for this? I am using T5

sgugger · January 28, 2021, 8:25pm

There is the decode method that could help.

berkayberabi · January 28, 2021, 10:01pm

are you sure that it can take multiple sentences? from the documentation, it seems to me that it only accepts one sample.

sgugger · January 28, 2021, 10:04pm

For multiple sentences there is its counterpart batch_decode.

Topic		Replies	Views
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4258	March 11, 2025
Issue with Decoding in HuggingFace 🤗Tokenizers	2	3871	March 24, 2022
Is detokenize available in transformer lib? 🤗Transformers	2	2755	April 24, 2023
Train T5/BART to convert a string into multiple strings 🤗Transformers	1	1677	December 10, 2022
Translate from one tokenizer to another 🤗Tokenizers	0	166	April 15, 2024

Efficient detokenization method

Related topics