Encoding and then decodeing text is not equal

ron5569 · August 11, 2024, 9:08am

I wonder why in some cases, encdoing the text and then decoding it, is not the same the original text
For example very simple code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")

texts = ['  A LoyaltyImport is a ...', '  You can ...', '    .withVariants("true")']
for row in texts:
    encoded = tokenizer(row,
                        add_special_tokens=False,
                        truncation=True,
                        padding=False,
                        max_length=2000,
                        return_overflowing_tokens=False,

                        return_length=False, )
    decoded = tokenizer.decode(encoded["input_ids"])
    if decoded != row:
        print("Different")
        print(row)
        print(decoded)

Then I see the output

Different
  A LoyaltyImport is a ...
  A LoyaltyImport is a...
Different
  You can ...
  You can...
Different
    .withVariants("true")
   .withVariants("true")

This is happen especially when using of dots and spaces.
Why does it happen and how can I fix that
Thanks!

mapama247 · August 12, 2024, 7:59am

Just add this line after loading the tokenizer:
tokenizer.clean_up_tokenization_spaces = False

Llama3 has it to true by default, as you can see on its tokenizer_config.json file.

system · August 13, 2024, 2:01pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tokenizer mapping the same token to multiple token_ids 🤗Tokenizers	4	656	April 22, 2024
Llama2 tokenizer duplicate ids Beginners	2	1428	April 21, 2024
How to decode with spaces? 🤗Tokenizers	0	1859	April 28, 2022
How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step? 🤗Tokenizers	1	2861	February 22, 2023
Difference between tokenizer and convert_tokens_to_ids 🤗Tokenizers	0	298	May 12, 2024

Encoding and then decodeing text is not equal

Related topics