Hello, I am trying to understand how T5 sentencepiece impacts custom data set. I know T5 does not use lossless training(mT5 does) but unsure of what impact it may have on any custom tokens in my dataset. Can someone please chime in if you have some insight ?
Sorry I meant lossless Tokenization. Please refer to section 3.1 in link below
from the paper:
We call this design lossless tokenization, in which all the information to reproduce the normalized text is preserved in the encoder’s output. The basic idea of lossless tokenization is to treat the input text just as a sequence of Unicode characters. Even whitespace is handled as a normal symbol. For the sake of clarity, SentencePiece first escapes the whitespace with a meta symbol _ (U+2581), and tokenizes the input into an arbitrary subword sequence, for example: