Hi all, my tokenizer configurations is as follow:
tokenizer = Tokenizer(BPE(
unk_token='[UNK]'
))
tokenizer.enable_padding(
direction='right',
pad_id=0,
pad_token='[PAD]'
)
tokenizer.enable_truncation(
max_length=512,
direction='right'
)
tokenizer.add_special_tokens(
list({'bos_token': '[BOS]',
'eos_token': '[EOS]',
'unk_token': '[UNK]',
'sep_token': '[SEP]',
'pad_token': '[PAD]',
'cls_token': '[CLS]',
'mask_token': '[MASK]'})
)
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(
add_prefix_space=False,
use_regex=False
)
tokenizer.post_processor = ByteLevel(trim_offsets=True)
tokenizer.decoder = decoders.ByteLevel()
trainer = BpeTrainer(
special_tokens=['[PAD]', '[UNK]', '[EOS]', '[BOS]', '[MASK]'],
show_progress=True,
initial_alphabet=['[BOS]']
)
tokenizer.train(files=[datasets_path], trainer=trainer)
After testing this trained tokenizer, it works fine:
encoded_smi = tokenizer.encode('[BOS]c1ccccc1')
print(f"Encoded Tokens: {encoded_smi.ids}")
>> Encoded Tokens: [3, 49, 12, 74, 12]
decoded_smi = tokenizer.decode(encoded_smi.ids)
print(f"Decoded SMILES: {decoded_smi}")
>> Decoded SMILES: c1ccccc1
However, when I try to use Transformers Tokenizer API, everything is change:
from transformers import AutoTokenizer
new_tokenizer = AutoTokenizer.from_pretrained('./path/to/tokenization/')
new_tokenizer
>> PreTrainedTokenizerFast(name_or_path='./path/to/tokenization/', vocab_size=11449, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[BOS]', 'eos_token': '[EOS]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
I found out that model_max_len
is different from the setting, however it should be 512
instead. Then I make a test on new_tokenizer:
test_smi = 'C'*1024*128*2*2
tokens = new_tokenizer.encode(test_smi)
display(len(tokens), tokens[1], tokens[-1])
>>
2048
6014
6014
It’s obvious that tokenizer didn’t truncate the seq. However if i change the code to:
len(new_tokenizer(test_smi, truncation=True, max_length=512)['input_ids'])
>> 512
Then everything worked fine! I wonder why AutoTokenizer.from_pretrained()
API doesn’t keep the maximum length setting, and would it affect the Transformer training phase…?