How to train gpt-2 from scratch? (no fine-tuning)

Hi, I would like to train GPT-2 from scratch. I don’t want to fine-tuning an existing model, but actually train it from scratch with my own tokenizer. How could I do it?

Thanks.

Hi @iamnotapenguin, the place I would start is by adapting the following script for causal language modelling to your dataset: transformers/run_clm.py at master · huggingface/transformers · GitHub

This script allows you to specify both the tokenizer and the model architecture, plus you can do multi-gpu training which is advisable if you’re training from scratch.

Hope that helps!

1 Like

Hi @lewtun ,
the repo link which you have shared is not available now, can you please shared the updated one?

Hey @yubi-sanprit ,
I believe he referred to this script.

1 Like

Thank you @IdoAmit198

@IdoAmit198 I need to load the trained gpt2 model with above script, can you please related repo url?

Hey @yubi-sanprit
You can specify to load a pretrained gpt2 by passing the flag --model_name_or_path with the value gpt2.
You can see some examples to run the script in the repo’s README.md
You can also run the script I referred to with the flag --help alone to see more helpful information and options to use this script.

Thank you @IdoAmit198

@IdoAmit198 I have around 230GB of data, how can I pass this huge amount of data while training from scratch? is there any way to pass via lazy loading i.e. only a chunk of data will go from disk to memory and will be deleted once model will be trained on that chunk.

Also, will it detect multiple GPUs, or I will need to specify something for this?

From my limited knowledge, I’d say that run_clm_no_trainer.py is more suitable for customisations such as lazy loading.

Regards your big data, I think streaming would be a good option (Load the dataset as IterableDataset). You can read about it here. If you decided it would fit you, then you can still use the run_clm.py or run_clm_no_trainer.py scripts and just make your own changes to it. For example, when you call load_dataset() you should pass streaming=True or verify that when you use your data you don’t use random access (since it’s an iterable dataset).

If you run the script as is it shouldn’t detect multiple GPUs and use them. What you’re looking for is a distributed training. There are few ways to do that, I’ll list two I’m familiar with:

  1. torch.distributed.launch. In case all your GPUs are in 1 node, you should run something like the follows:

python -m torch.distributed.launch --nproc_per_node=#NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (–arg1 --arg2 --arg3 and all other arguments of your training script)

In case you’re using multiple nodes, check out the link for a complete explanation.

  1. torchrun. Again, the correct command depends whether your GPUs in 1 node, or multiple nodes. In case of multiple nodes read in the link. In case of 1 node like me, try to follows:

torchrun
--standalone
--nnodes=1
`–nproc_per_node=$NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (–arg1 … train script args…)

Hope it helps you buddy.

Thank you @IdoAmit198 @cerdwin

@IdoAmit198 @cerdwin

line_by_line argument is not available with pytorch run_clm.py but it is there in tensorflow script.

I don’t want to merge sentences after tokenization, is there any way to do this in pytorch

@IdoAmit198
with self-trained GPT-2, when I am generating the text, ti gives the following error, while it works with pre-trained gpt-2

from transformers import pipeline, set_seed
my_generator_2 = pipeline(task=‘text-generation’, model=‘checkpoint-24000’,tokenizer=gpt_tokenizer,framework=‘pt’)
set_seed(42)

text = ‘research paper’
my_generator_2(text.lower(),max_length=250,num_return_sequences=2)

“num_return_sequences has to be 1, but is 2 when doing greedy search.”

To be honest I’m not familiar with that kind of error since I’m not really using pipeline.
I’d recommend posting a new post details what you’re trying to achieve and the error you get.
Good luck :slight_smile:

Thank you @IdoAmit198 .
can you please look into this issue

In answer to my question on big data size and lazy loading:
Transformers dataset dict format and its map method to call any function like tokenisation and grouping is designed to run in batches.It will handle any big data with batch run. So, work with any size big data use convert your dataset in Transformers dataset dict format and map method

@IdoAmit198 , In reply to this question:

with self-trained GPT-2, when I am generating the text, ti gives the following error, while it works with pre-trained gpt-2

from transformers import pipeline, set_seed
my_generator_2 = pipeline(task=‘text-generation’, model=‘checkpoint-24000’,tokenizer=gpt_tokenizer,framework=‘pt’)
set_seed(42)

text = ‘research paper’
my_generator_2(text.lower(),max_length=250,num_return_sequences=2)

“num_return_sequences has to be 1, but is 2 when doing greedy search.”

Transformers inbuilt pipeline for text generation is giving this error. But following method is working well and giving more than one sequences in return:

model = AutoModelForCausalLM.from_pretrained(“checkpoint-80000”)

def text_generator(prompt):
prompt = prompt.lower()
input_ids = gpt_tokenizer(prompt, return_tensors=“pt”).input_ids
outputs = model.generate(input_ids, do_sample=True, max_length=150,min_length=100, temperature=1.5, num_return_sequences=10, top_k=50, top_p=25)
output_text = gpt_tokenizer.batch_decode(outputs, skip_special_tokens=True)
return output_text