How to train gpt-2 from scratch? (no fine-tuning)

Hi, I would like to train GPT-2 from scratch. I don’t want to fine-tuning an existing model, but actually train it from scratch with my own tokenizer. How could I do it?


Hi @iamnotapenguin, the place I would start is by adapting the following script for causal language modelling to your dataset: transformers/ at master · huggingface/transformers · GitHub

This script allows you to specify both the tokenizer and the model architecture, plus you can do multi-gpu training which is advisable if you’re training from scratch.

Hope that helps!

Hi @lewtun ,
the repo link which you have shared is not available now, can you please shared the updated one?

Hey @yubi-sanprit ,
I believe he referred to this script.

Thank you @IdoAmit198

@IdoAmit198 I need to load the trained gpt2 model with above script, can you please related repo url?

Hey @yubi-sanprit
You can specify to load a pretrained gpt2 by passing the flag --model_name_or_path with the value gpt2.
You can see some examples to run the script in the repo’s
You can also run the script I referred to with the flag --help alone to see more helpful information and options to use this script.

Thank you @IdoAmit198

@IdoAmit198 I have around 230GB of data, how can I pass this huge amount of data while training from scratch? is there any way to pass via lazy loading i.e. only a chunk of data will go from disk to memory and will be deleted once model will be trained on that chunk.

Also, will it detect multiple GPUs, or I will need to specify something for this?

From my limited knowledge, I’d say that is more suitable for customisations such as lazy loading.

Regards your big data, I think streaming would be a good option (Load the dataset as IterableDataset). You can read about it here. If you decided it would fit you, then you can still use the or scripts and just make your own changes to it. For example, when you call load_dataset() you should pass streaming=True or verify that when you use your data you don’t use random access (since it’s an iterable dataset).

If you run the script as is it shouldn’t detect multiple GPUs and use them. What you’re looking for is a distributed training. There are few ways to do that, I’ll list two I’m familiar with:

  1. torch.distributed.launch. In case all your GPUs are in 1 node, you should run something like the follows:

python -m torch.distributed.launch --nproc_per_node=#NUM_GPUS_YOU_HAVE (–arg1 --arg2 --arg3 and all other arguments of your training script)

In case you’re using multiple nodes, check out the link for a complete explanation.

  1. torchrun. Again, the correct command depends whether your GPUs in 1 node, or multiple nodes. In case of multiple nodes read in the link. In case of 1 node like me, try to follows:

`–nproc_per_node=$NUM_GPUS_YOU_HAVE (–arg1 … train script args…)

Hope it helps you buddy.

Thank you @IdoAmit198 @cerdwin

@IdoAmit198 @cerdwin

line_by_line argument is not available with pytorch but it is there in tensorflow script.

I don’t want to merge sentences after tokenization, is there any way to do this in pytorch

with self-trained GPT-2, when I am generating the text, ti gives the following error, while it works with pre-trained gpt-2

from transformers import pipeline, set_seed
my_generator_2 = pipeline(task=‘text-generation’, model=‘checkpoint-24000’,tokenizer=gpt_tokenizer,framework=‘pt’)

text = ‘research paper’

“num_return_sequences has to be 1, but is 2 when doing greedy search.”

To be honest I’m not familiar with that kind of error since I’m not really using pipeline.
I’d recommend posting a new post details what you’re trying to achieve and the error you get.
Good luck :slight_smile:

Thank you @IdoAmit198 .
can you please look into this issue