I am currently trying to train a MLM using a ByteLevelBPETokenizer on a custom corpus and am getting the following error:
AttributeError: ‘tokenizers.Tokenizer’ object has no attribute ‘mask_token’
Shown below is the code:
BOS = “””
EOS = “
UNK = “”
PAD = “”
MASK = “”
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.enable_truncation(max_length=512)
tokenizer.enable_padding()
trainer = BpeTrainer(
vocab_size=50000,
special_tokens=[BOS, PAD, EOS, UNK, MASK],
initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer)
tokenizer.post_processor = RobertaProcessing(
sep=(EOS, tokenizer.token_to_id(EOS)),
cls=(BOS, tokenizer.token_to_id(BOS))
)
data_collator = DataCollatorForLanguageModeling(
tokenizer,
mlm_probability=0.15,
return_tensors=‘tf’)
Any ideas? The current environment makes it difficult for me to save the tokenizer and load it back using a load from pretrained.
Thanks