I wonder why in some cases, encdoing the text and then decoding it, is not the same the original text
For example very simple code
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
texts = [' A LoyaltyImport is a ...', ' You can ...', ' .withVariants("true")']
for row in texts:
encoded = tokenizer(row,
add_special_tokens=False,
truncation=True,
padding=False,
max_length=2000,
return_overflowing_tokens=False,
return_length=False, )
decoded = tokenizer.decode(encoded["input_ids"])
if decoded != row:
print("Different")
print(row)
print(decoded)
Then I see the output
Different
A LoyaltyImport is a ...
A LoyaltyImport is a...
Different
You can ...
You can...
Different
.withVariants("true")
.withVariants("true")
This is happen especially when using of dots and spaces.
Why does it happen and how can I fix that
Thanks!