Encoding Reproducable Results

PhilippIce · November 26, 2020, 6:26pm

Hey everybody,
For a project I need to be able to get the encoding, that produces a specific text, from the text. However, it is important, that the text has the same encoding as the encoding that was outputted by the NN.

From my understanding, the encoding should be unique given the same settings for the creation of the text were used.
When I do this, however, I get different results:
tok = GPT2Tokenizer.from_pretrained(‘gpt2-medium’)
indexVals=[326, 220, 39608]
text=tok.decode(indexVals)
print(text)
indexValsBack=tok.encode(text)
print(tok.decode(indexValsBack))
print(indexVals)
print(indexValsBack)

The output text both times is " that gey", however, the encings are once:
[326, 220, 39608]
and once:
[326, 308, 2959]

Do you have any idea why and what to do about that?

Thanks!

Topic		Replies	Views
Encoding and then decodeing text is not equal 🤗Tokenizers	2	196	August 12, 2024
Which encoding does GPT2 vocabulary file use? Beginners	3	1615	August 1, 2021
Strange answer from api 🤗Transformers	0	617	January 10, 2022
Difference between text-generation and text2text? Canonical way to provide multiple demonstrations? Beginners	1	3757	January 25, 2024
GPT2Tokenizer.decode maps unicode sequences to the same string '�' 🤗Tokenizers	3	1189	March 15, 2023

Encoding Reproducable Results

Related topics