How to train gpt-2 from scratch? (no fine-tuning)

iamnotapenguin · January 23, 2021, 5:17am

Hi, I would like to train GPT-2 from scratch. I don’t want to fine-tuning an existing model, but actually train it from scratch with my own tokenizer. How could I do it?

Thanks.

lewtun · January 23, 2021, 8:24am

Hi @iamnotapenguin, the place I would start is by adapting the following script for causal language modelling to your dataset: transformers/run_clm.py at master · huggingface/transformers · GitHub

This script allows you to specify both the tokenizer and the model architecture, plus you can do multi-gpu training which is advisable if you’re training from scratch.

Hope that helps!

yubi-sanprit · October 20, 2022, 7:08am

Hi @lewtun ,
the repo link which you have shared is not available now, can you please shared the updated one?

IdoAmit198 · October 20, 2022, 8:07am

Hey @yubi-sanprit ,
I believe he referred to this script.

yubi-sanprit · October 21, 2022, 5:45am

Thank you @IdoAmit198

yubi-sanprit · October 30, 2022, 12:59am

@IdoAmit198 I need to load the trained gpt2 model with above script, can you please related repo url?

IdoAmit198 · October 31, 2022, 12:19pm

Hey @yubi-sanprit
You can specify to load a pretrained gpt2 by passing the flag --model_name_or_path with the value gpt2.
You can see some examples to run the script in the repo’s README.md
You can also run the script I referred to with the flag --help alone to see more helpful information and options to use this script.

yubi-sanprit · November 3, 2022, 12:02pm

Thank you @IdoAmit198

yubi-sanprit · November 7, 2022, 2:49pm

@IdoAmit198 I have around 230GB of data, how can I pass this huge amount of data while training from scratch? is there any way to pass via lazy loading i.e. only a chunk of data will go from disk to memory and will be deleted once model will be trained on that chunk.

Also, will it detect multiple GPUs, or I will need to specify something for this?

cerdwin · November 8, 2022, 12:56pm

From my limited knowledge, I’d say that run_clm_no_trainer.py is more suitable for customisations such as lazy loading.

IdoAmit198 · November 8, 2022, 2:26pm

Regards your big data, I think streaming would be a good option (Load the dataset as IterableDataset). You can read about it here. If you decided it would fit you, then you can still use the run_clm.py or run_clm_no_trainer.py scripts and just make your own changes to it. For example, when you call load_dataset() you should pass streaming=True or verify that when you use your data you don’t use random access (since it’s an iterable dataset).

If you run the script as is it shouldn’t detect multiple GPUs and use them. What you’re looking for is a distributed training. There are few ways to do that, I’ll list two I’m familiar with:

torch.distributed.launch. In case all your GPUs are in 1 node, you should run something like the follows:

python -m torch.distributed.launch --nproc_per_node=#NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (–arg1 --arg2 --arg3 and all other arguments of your training script)

In case you’re using multiple nodes, check out the link for a complete explanation.

torchrun. Again, the correct command depends whether your GPUs in 1 node, or multiple nodes. In case of multiple nodes read in the link. In case of 1 node like me, try to follows:

torchrun
--standalone
--nnodes=1
`–nproc_per_node=$NUM_GPUS_YOU_HAVE
YOUR_TRAINING_SCRIPT.py (–arg1 … train script args…)

Hope it helps you buddy.

yubi-sanprit · November 9, 2022, 2:47am

Thank you @IdoAmit198 @cerdwin

yubi-sanprit · November 10, 2022, 11:22am

@IdoAmit198 @cerdwin

line_by_line argument is not available with pytorch run_clm.py but it is there in tensorflow script.

I don’t want to merge sentences after tokenization, is there any way to do this in pytorch

yubi-sanprit · November 18, 2022, 5:14am

@IdoAmit198
with self-trained GPT-2, when I am generating the text, ti gives the following error, while it works with pre-trained gpt-2

from transformers import pipeline, set_seed
my_generator_2 = pipeline(task=‘text-generation’, model=‘checkpoint-24000’,tokenizer=gpt_tokenizer,framework=‘pt’)
set_seed(42)

text = ‘research paper’
my_generator_2(text.lower(),max_length=250,num_return_sequences=2)

“num_return_sequences has to be 1, but is 2 when doing greedy search.”

IdoAmit198 · November 21, 2022, 5:45am

To be honest I’m not familiar with that kind of error since I’m not really using pipeline.
I’d recommend posting a new post details what you’re trying to achieve and the error you get.
Good luck

yubi-sanprit · November 21, 2022, 5:49am

Thank you @IdoAmit198 .
can you please look into this issue

yubi-sanprit · December 14, 2022, 5:21am

In answer to my question on big data size and lazy loading:
Transformers dataset dict format and its map method to call any function like tokenisation and grouping is designed to run in batches.It will handle any big data with batch run. So, work with any size big data use convert your dataset in Transformers dataset dict format and map method

yubi-sanprit · December 14, 2022, 5:26am

@IdoAmit198 , In reply to this question:

with self-trained GPT-2, when I am generating the text, ti gives the following error, while it works with pre-trained gpt-2

from transformers import pipeline, set_seed
my_generator_2 = pipeline(task=‘text-generation’, model=‘checkpoint-24000’,tokenizer=gpt_tokenizer,framework=‘pt’)
set_seed(42)

text = ‘research paper’
my_generator_2(text.lower(),max_length=250,num_return_sequences=2)

“num_return_sequences has to be 1, but is 2 when doing greedy search.”

Transformers inbuilt pipeline for text generation is giving this error. But following method is working well and giving more than one sequences in return:

model = AutoModelForCausalLM.from_pretrained(“checkpoint-80000”)

def text_generator(prompt):
prompt = prompt.lower()
input_ids = gpt_tokenizer(prompt, return_tensors=“pt”).input_ids
outputs = model.generate(input_ids, do_sample=True, max_length=150,min_length=100, temperature=1.5, num_return_sequences=10, top_k=50, top_p=25)
output_text = gpt_tokenizer.batch_decode(outputs, skip_special_tokens=True)
return output_text

Topic		Replies	Views
GPT2 Training from scratch in German 🤗Transformers	3	2312	October 3, 2020
Fine-tune, or train from scratch? Beginners	6	3461	September 16, 2020
Train GPT2 on wikitext from scratch Beginners	5	3839	October 25, 2021
Need help with gpt2 model Beginners	0	588	July 9, 2023
Training GPT-2 from scratch Beginners	2	1232	August 3, 2020

How to train gpt-2 from scratch? (no fine-tuning)

Related topics