Encoding and then decodeing text is not equal

I wonder why in some cases, encdoing the text and then decoding it, is not the same the original text
For example very simple code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

texts = ['  A LoyaltyImport is a ...', '  You can ...', '    .withVariants("true")']
for row in texts:
    encoded = tokenizer(row,
                        add_special_tokens=False,
                        truncation=True,
                        padding=False,
                        max_length=2000,
                        return_overflowing_tokens=False,

                        return_length=False, )
    decoded = tokenizer.decode(encoded["input_ids"])
    if decoded != row:
        print("Different")
        print(row)
        print(decoded)

Then I see the output

Different
  A LoyaltyImport is a ...
  A LoyaltyImport is a...
Different
  You can ...
  You can...
Different
    .withVariants("true")
   .withVariants("true")

This is happen especially when using of dots and spaces.
Why does it happen and how can I fix that
Thanks!

1 Like

Just add this line after loading the tokenizer:
tokenizer.clean_up_tokenization_spaces = False

Llama3 has it to true by default, as you can see on its tokenizer_config.json file.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.