Hi,
tried the whole day to find out how to train my own BPETokenizer and bush it to the hub for training an own GPT2 model. My Code looks like this:
from transformers import AutoTokenizer
from datasets import load_dataset
from tokenizers import (
decoders,
models,
normalizers,
pre_tokenizers,
processors,
trainers,
Tokenizer,
)
from tqdm import tqdm
from transformers import GPT2TokenizerFast
raw_dataset = load_dataset('<dataset>')
def get_training_corpus():
dataset = raw_dataset['train']
for start_idx in range(0, len(dataset), 1000):
samples = dataset[start_idx : start_idx + 1000]
yield samples['content']
# Train Tokenizer and wrapped in a GPT2TokenizerFast
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
trainer = trainers.BpeTrainer(vocab_size=75_000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
tokenizer.decoder = decoders.ByteLevel()
I tried to save the vocab and merge files using:
tokenizer.save_model("gpt2-xl-de")
But it throws an error: AttributeError: ‘tokenizers.Tokenizer’ object has no attribute ‘save_model’
I can save the model in one json file using
tokenizer.save("gpt2-xl-de/tokenizer_BPE.json")
I also tried to wrap it into a GPT2TokenizerFast:
wrapped_tokenizer = GPT2TokenizerFast(
tokenizer_object=tokenizer
)
But also this throws a error: TypeError: init() missing 2 required positional arguments: ‘vocab_file’ and ‘merges_file’
I also didn’t know how to push this to the hub.
I would expect to do this with.
wrapped_tokenizer.save_pretrained("<name>", push_to_hub=True, repo_name='<name>')
Is this the right way?
Trough the fact that I not got it running and I did not understand that the code of the documentation is not working for me. I hope that somebody here can point me to the right direction.
I also tried this:
from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
new_tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus(), vocab_size=75_000, min_frequency=2, special_tokens=["<|endoftext|>"])
But also here I got an error:
AttributeError: ‘GPT2TokenizerFast’ object has no attribute ‘train_new_from_iterator’
Please help me. I tried to use the code from the documentation but it is not working.
Why is this?
Thanks