For a project I need to be able to get the encoding, that produces a specific text, from the text. However, it is important, that the text has the same encoding as the encoding that was outputted by the NN.
From my understanding, the encoding should be unique given the same settings for the creation of the text were used.
When I do this, however, I get different results:
tok = GPT2Tokenizer.from_pretrained(‘gpt2-medium’)
indexVals=[326, 220, 39608]
The output text both times is " that gey", however, the encings are once:
[326, 220, 39608]
[326, 308, 2959]
Do you have any idea why and what to do about that?