Text to Text Transformer - T5

dkamalakar · December 31, 2020, 1:53pm

Hello, I am trying to understand how T5 sentencepiece impacts custom data set. I know T5 does not use lossless training(mT5 does) but unsure of what impact it may have on any custom tokens in my dataset. Can someone please chime in if you have some insight ?

Thanks

patrickvonplaten · January 3, 2021, 9:53pm

What do you mean by “lossless” training?

dkamalakar · January 4, 2021, 8:06pm

Sorry I meant lossless Tokenization. Please refer to section 3.1 in link below

from the paper:
We call this design lossless tokenization, in which all the information to reproduce the normalized text is preserved in the encoder’s output. The basic idea of lossless tokenization is to treat the input text just as a sequence of Unicode characters. Even whitespace is handled as a normal symbol. For the sake of clarity, SentencePiece first escapes the whitespace with a meta symbol _ (U+2581), and tokenizes the input into an arbitrary subword sequence, for example:

Topic		Replies	Views
SentencePiece tokenizer encodes to unknown token 🤗Tokenizers	0	878	August 2, 2023
SentencePiece tokenizer Beginners	2	122	February 22, 2025
How do you use SentencePiece for BPE of sequences with no whitespace 🤗Tokenizers	1	2085	April 29, 2021
Tokenization compared to sentencepiece 🤗Tokenizers	0	89	September 11, 2024
SentencePieceUnigramTokenizer 🤗Tokenizers	0	680	September 22, 2022

Text to Text Transformer - T5

Related topics