Truncate the seq. not working

Hi all, my tokenizer configurations is as follow:

tokenizer = Tokenizer(BPE(
    unk_token='[UNK]'
))
tokenizer.enable_padding(
    direction='right',
    pad_id=0,
    pad_token='[PAD]'
)
tokenizer.enable_truncation(
    max_length=512, 
    direction='right'
)
tokenizer.add_special_tokens(
    list({'bos_token': '[BOS]',
         'eos_token': '[EOS]',
         'unk_token': '[UNK]',
         'sep_token': '[SEP]',
         'pad_token': '[PAD]',
         'cls_token': '[CLS]',
         'mask_token': '[MASK]'})
)
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
    add_prefix_space=False, 
    use_regex=False 
) 
tokenizer.post_processor = ByteLevel(trim_offsets=True)
tokenizer.decoder = decoders.ByteLevel()

trainer = BpeTrainer(
    special_tokens=['[PAD]', '[UNK]', '[EOS]', '[BOS]', '[MASK]'], 
    show_progress=True, 
    initial_alphabet=['[BOS]']
)

tokenizer.train(files=[datasets_path], trainer=trainer)

After testing this trained tokenizer, it works fine:

encoded_smi = tokenizer.encode('[BOS]c1ccccc1')
print(f"Encoded Tokens: {encoded_smi.ids}")
>> Encoded Tokens: [3, 49, 12, 74, 12]

decoded_smi = tokenizer.decode(encoded_smi.ids)
print(f"Decoded SMILES: {decoded_smi}")
>> Decoded SMILES: c1ccccc1

However, when I try to use Transformers Tokenizer API, everything is change:

from transformers import AutoTokenizer
new_tokenizer = AutoTokenizer.from_pretrained('./path/to/tokenization/')
new_tokenizer
>> PreTrainedTokenizerFast(name_or_path='./path/to/tokenization/', vocab_size=11449, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[BOS]', 'eos_token': '[EOS]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

I found out that model_max_len is different from the setting, however it should be 512 instead. Then I make a test on new_tokenizer:

test_smi = 'C'*1024*128*2*2
tokens = new_tokenizer.encode(test_smi)
display(len(tokens), tokens[1], tokens[-1])
>> 
2048
6014
6014

It’s obvious that tokenizer didn’t truncate the seq. However if i change the code to:

len(new_tokenizer(test_smi, truncation=True, max_length=512)['input_ids'])
>> 512

Then everything worked fine! I wonder why AutoTokenizer.from_pretrained() API doesn’t keep the maximum length setting, and would it affect the Transformer training phase…?