Tokenizer Saving Issues, Wrapper Issues and Push to Hub issues

Hi,

tried the whole day to find out how to train my own BPETokenizer and bush it to the hub for training an own GPT2 model. My Code looks like this:

from transformers import AutoTokenizer
from datasets import load_dataset
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)
from tqdm import tqdm
from transformers import GPT2TokenizerFast

raw_dataset = load_dataset('<dataset>')

def get_training_corpus():
    dataset = raw_dataset['train']
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples['content']

# Train Tokenizer and wrapped in a GPT2TokenizerFast
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
trainer = trainers.BpeTrainer(vocab_size=75_000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)
tokenizer.decoder = decoders.ByteLevel()

I tried to save the vocab and merge files using:

tokenizer.save_model("gpt2-xl-de")

But it throws an error: AttributeError: ‘tokenizers.Tokenizer’ object has no attribute ‘save_model’

I can save the model in one json file using

tokenizer.save("gpt2-xl-de/tokenizer_BPE.json")

I also tried to wrap it into a GPT2TokenizerFast:

wrapped_tokenizer = GPT2TokenizerFast(
    tokenizer_object=tokenizer
)

But also this throws a error: TypeError: init() missing 2 required positional arguments: ‘vocab_file’ and ‘merges_file’

I also didn’t know how to push this to the hub.
I would expect to do this with.

wrapped_tokenizer.save_pretrained("<name>", push_to_hub=True, repo_name='<name>')

Is this the right way?

Trough the fact that I not got it running and I did not understand that the code of the documentation is not working for me. I hope that somebody here can point me to the right direction.

I also tried this:

from transformers import AutoTokenizer
old_tokenizer = AutoTokenizer.from_pretrained("gpt2")
new_tokenizer = old_tokenizer.train_new_from_iterator(get_training_corpus(), vocab_size=75_000, min_frequency=2, special_tokens=["<|endoftext|>"])

But also here I got an error:
AttributeError: ‘GPT2TokenizerFast’ object has no attribute ‘train_new_from_iterator’

Please help me. I tried to use the code from the documentation but it is not working.
Why is this?

Thanks

Perhaps it depends on the different tpye of classes I use.

For building up my tokenizer I used the Tokenizer Pipeline Library. But this library as only the save method for saving a tokenizer.json file.

How can I convert my pipeline to the transformers.GPT2TokenizerFast librrary. I tried to use the json file but I did get the error that the vocab and merge files are missing.

How are these both libraries connected?

Please help. Thanks.

I’m not sure, but is the following code the right code to transfer from the pipeline Tokenizer Class to the the transformer GPT2TokenizerFast class?

# After training the tokenizer 
tokenizer.save("gpt2-xl-de/tokenizer_BPE.json")
tokenizer.model.save("gpt2-xl-de/")
# The above code saves the vocab and merge files
# Create GPT2TokenizerFast
from transformers import GPT2TokenizerFast
wrapped_tokenizer = GPT2TokenizerFast(tokenizer_file="gpt2-xl-de/tokenizer_BPE.json", vocab_file="gpt2-xl-de/vocab.json", merges_file="gpt2-xl-de/merges.txt", add_prefix_space=False)

It works for the tokenizer. But still, I cannot push it to the hub. Through the fact that GPT2TokenizerFast has no push_to_hub method.

Please help to get some more understanding how these both worlds (tokenizer library and transforrmer library) are connected to each other and how to push to hub.

How did you solve it?