Efficient detokenization method

Tokenizer has call function nad it accepts List[List[string]] so one can tokenize multiple samples at once efficiently. I am looking for a function that reverts this behavior. There are convert_ids_to_tokens() and convert_tokens_to_string() but they only accept List[int] and List[string]. I do not want to iterate over the sample and convert them back one by one.

Is there an efficient method for this? I am using T5

There is the decode method that could help.

are you sure that it can take multiple sentences? from the documentation, it seems to me that it only accepts one sample.

For multiple sentences there is its counterpart batch_decode.

1 Like