GPT2 Training from scratch in German

[TL;DR]

Please go through the questions and any help would be appreciated.

I am trying to train GPT2 model from scratch.
I am not sure if I am doing right and I have got a few questions. Here is my current implementation.

Tokenizer:
Question1:
Am I training the tokenizer right way? Should I use all of training text files to train tokenizers?

from pathlib import Path

from tokenizers import ByteLevelBPETokenizer

paths = [str(x) for x in Path(".").glob("**/*.txt")]

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

!mkdir German
tokenizer.save_model("German")

The above code gives me : [‘German/vocab.json’, ‘German/merges.txt’]

Initializing the GPT2 tokenizer
Question2: Is this the right way to do it?

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("./German",additional_special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"],pad_token='<pad>' ,max_len=512)

Initlizing the GPT2 model:
Question3: Am I initilizing GPT2 Language model properly?

from transformers import GPT2Model, GPT2Config,GPT2LMHeadModel

# Initializing a GPT2 configuration
configuration = GPT2Config(vocab_size=52_000)
model = GPT2LMHeadModel(config=configuration) 

Dataset:

My dataset is just an example text file containing 1 million German Dataset:

Question4: Now What kind of dataset do you suggest I use to train this model?

Here is my logic for Dataset loading and training.

from transformers import TextDataset

dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="./deu-de_web-public_2019_1M-sentences.txt",
    block_size=128,
)

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False,
)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
    prediction_loss_only=True,
)

trainer.train()

Question5: Is this the way to train the model? Is there any specific data format the model expects the training data to be?

Question6: HOw can we use multiple GPUs?

1 Like

Hey, @vikasRajashekar did this method worked? I am trying this on Hindi using the same approach

@patrickvonplaten Would be helpful if you give some inputs. With that if I succeed I will create a clear blog explaining the steps so that others also can benefit.

2 Likes

Hey @vikasRajashekar
It worked for me.

1 Like