Finetuning GPT2 using Multiple GPU and Trainer

I’m finetuning GPT2 on my corpus for text generation. I am also using the Trainer class to handle the training. I have multiple gpu available to me. As I understand from the documentation and forum, if I wanted to utilze these multiple gpu for training in Trainer, I would set the no_cuda parameter to False (which it is by default). Is there anything else that needs to be done in order to utilize these gpu in Trainer for training?

Thanks in advance!

It starts training on multiple GPU’s if available. You can control which GPU’s to use using CUDA_VISIBLE_DEVICES environment variable i.e if CUDA_VISIBLE_DEVICES=1,2 then it’ll use the 1 and 2 cuda devices. Pinging @sgugger for more info.

1 Like

@valhalla and this is why HF is awesome! Thanks for the response.

Is it possible to see 1) If Hf is using cpu or gpu and 2) which gpus HF is using? Just a bit of information I’d like to add to my logs while training.

If you look at TrainingArguments.n_gpus, it will give you the number of GPUs used.

@valhalla
I encountered this error while trying to utilize multiple gpu for training:

Traceback (most recent call last):
  File "run_finetune_gpt2.py", line 158, in <module>
    main()
  File "run_finetune_gpt2.py", line 145, in main
    trainer.train()
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/path/to/venvs/my-venv/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
    assert all(map(lambda i: i.is_cuda, inputs))
AssertionError
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb: You can sync this run to the cloud by running:
wandb: wandb sync wandb/dryrun-20200914_134757-1sih3p0q

Any thoughts about how to correct it?

Could you post the command/script_snippet that you used to launch training ?

@valhalla, happy to. Here is the snippet:

from transformers import GPT2Tokenizer, TrainingArguments, Trainer
import torch
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, csv_file: str):
            self.df = pd.read_csv(csv_file, encoding='ISO-8859-1')

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        text = self.df.iloc[idx, 1]
        return text



def my_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token

    encoded_results = tokenizer(dataset_samples_list, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['past'] = None
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['position_ids'] = None
    batch['head_mask'] = None
    batch['inputs_embeds'] = None
    batch['labels'] = None
    batch['use_cache'] = True
    return batch


dataset_train = MyDataset('/path/to/train_dataset.csv')

training_args = TrainingArguments(
    output_dir='/path/to/out',
    do_train=True,
    per_device_train_batch_size=64,
    logging_dir='/path/to/dir', 
    max_steps=300000
)

model = GPT2FinetunedWithNgrams.from_pretrained('gpt2')

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=my_data_collator,
    train_dataset=dataset_train
)
trainer.train()
trainer.save_model('/path/to/model_save_dir')

I’m working on getting the snippet together for model = GPT2FinetunedWithNgrams.from_pretrained('gpt2') so others can see how the loss etc. are being calculated. The above is the controller script for the training.

@valhalla Here is the full code snippet:

from transformers import GPT2Tokenizer, GPT2LMHeadModel, TrainingArguments, Trainer
import torch
from torch.utils.data import Dataset
import sys
import pandas as pd
#import numpy as np

ZERO = sys.float_info.min
ZERO_PT = torch.tensor(ZERO)

class GPT2FinetunedWithNgrams(GPT2LMHeadModel):
    def __init__(self, config, model_tokenizer=None):
        super().__init__(config)
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def eval_sentence(self, sent: str):
        vec = self.sentence_vec(
            sent)  # remove punct, lower case, split on space, prepend "<s>", postpend "</s>" start and stop tokens. Returns list of strings.
        last_idx = min(self.max_ngram, len(vec))

        log_prob = 0
        for i in range(2, last_idx + 1):
            #log_prob += np.log(max(ZERO, self.pkatz(vec[0:i])))  # conditional probability with katz backoff
            log_prob += torch.log(max(ZERO_PT, self.pkatz(vec[0:i])))

        for i in range(1, len(vec) - last_idx + 1):
            j = i + last_idx
            #log_prob += np.log(max(ZERO, self.pkatz(vec[i:j])))
            log_prob += torch.log(max(ZERO_PT, self.pkatz(vec[i:j])))
        return log_prob, len(vec)

    def sentence_loss(self, sent: str):
        p, l = self.eval_sentence(sent)
        return -p

    def generate_text_while_finetuning(self,
                                       input_ids=None,
                                       past=None,
                                       attention_mask=None,
                                       token_type_ids=None,
                                       position_ids=None,
                                       head_mask=None,
                                       inputs_embeds=None,
                                       labels=None,
                                       use_cache=None,
                                       output_attentions=None,
                                       output_hidden_states=None, ):
        transformer_outputs = self.transformer(
            input_ids,
            past=past,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
        )
        hidden_states = transformer_outputs[0]
        lm_logits = self.lm_head(hidden_states)
        outputs = (lm_logits,) + transformer_outputs[1:]
        return outputs  # (loss), lm_logits, presents, (all hidden_states), (attentions)

    def forward(
            self,
            input_ids=None,
            past=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            use_cache=True,
    ):

        max_length = input_ids.shape[1] + 50
        full_generated_gpt2_ids = self.generate(input_ids=input_ids,
                                                max_length=max_length,
                                                is_finetuning_current_model=True,
                                                attention_mask=attention_mask,
                                                pad_token_id=50256,
                                                do_sample=True,
                                                top_k=50,
                                                top_p=0.95)

        decoded_gen_samples = self.tokenizer.batch_decode(full_generated_gpt2_ids, skip_special_tokens=True)
        tmp_losses = [self.sentence_loss(decoded_sample) for decoded_sample in decoded_gen_samples]
        losses = torch.stack(tmp_losses)
        loss = losses.mean()
        loss.requires_grad = True
        return (loss,)


##The code below is the run script.
class MyDataset(Dataset):
    def __init__(self, csv_file: str):
            self.df = pd.read_csv(csv_file, encoding='ISO-8859-1')

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        text = self.df.iloc[idx, 1]
        return text

def my_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token

    encoded_results = tokenizer(dataset_samples_list, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['past'] = None
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['position_ids'] = None
    batch['head_mask'] = None
    batch['inputs_embeds'] = None
    batch['labels'] = None
    batch['use_cache'] = True
    return batch

dataset_train = MyDataset('/path/to/train_dataset.csv')

training_args = TrainingArguments(
    output_dir='/path/to/out',
    do_train=True,
    per_device_train_batch_size=64,
    logging_dir='/path/to/dir',
    max_steps=300000
)

model = GPT2FinetunedWithNgrams.from_pretrained('gpt2')

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=my_data_collator,
    train_dataset=dataset_train
)
trainer.train()
trainer.save_model('/path/to/model_save_dir')

@aclifton314 Hi, sorry I am trying to train and evaluate my GPT-2 by applying the trainer with GPU ,I am not sure how I can pass my model and the training data and evaluation data to the GPU in this form. I would appreciate your idea.
my code is:

model = AutoModel.from_pretrained(“”).to(“cuda”)

training_args = TrainingArguments(
output_dir=“./gpt2-gerchef”, #The output directory
overwrite_output_dir=True, #overwrite the content of the output directory
num_train_epochs=3, # number of training epochs
per_device_train_batch_size=32, # batch size for training
per_device_eval_batch_size=64, # batch size for evaluation
eval_steps = 400, # Number of update steps between two evaluations.
save_steps=800, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
)

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)

@SUNM It is my understanding that if GPUs are available then Trainer will use them. One thing we can do to check this is to keep your code as it is written above, and try to run it on a very small sample of the train_dataset and eval_dataset. Something like 100 examples for the train and maybe 20 or 50 for the eval. I wouldn’t be worried about what the model evaluates to, we just want to see if it is performing the training and evaluation on GPU.

One straight forward way to do this is run your training and evaluation in one tab from the command line and open another command line tab and constantly run nvidia-smi to see if GPU utilization is happening during training and evaluation. Another command if you have it is to keep nvtop open in that second tab and it will continuously update on its own so you don’t have to constantly refresh it.

I would say start there and report back what you find and we can go from there. There might be a better way to check if @valhalla or @sgugger want to chime in.

Lastly, while the code snippet is fairly straightforward to read and understand it can help those wishing to respond to easier read the code if it is surrounded by tick marks. For instance, instead of:
model = AutoModel.from_pretrained(“”).to(“cuda”)

training_args = TrainingArguments(
output_dir=“./gpt2-gerchef”, #The output directory
overwrite_output_dir=True, #overwrite the content of the output directory
num_train_epochs=3, # number of training epochs
per_device_train_batch_size=32, # batch size for training
per_device_eval_batch_size=64, # batch size for evaluation
eval_steps = 400, # Number of update steps between two evaluations.
save_steps=800, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
)

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)

You can place three tick marks at the top and bottom of the code block to format it into:

model = AutoModel.from_pretrained(“”).to(“cuda”)

training_args = TrainingArguments(
output_dir=“./gpt2-gerchef”, #The output directory
overwrite_output_dir=True, #overwrite the content of the output directory
num_train_epochs=3, # number of training epochs
per_device_train_batch_size=32, # batch size for training
per_device_eval_batch_size=64, # batch size for evaluation
eval_steps = 400, # Number of update steps between two evaluations.
save_steps=800, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
)

trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)

which makes it easier to dissect and read. Here’s a short read about it.

Happy to help out. Let me know what you find.

@aclifton314 many thnaks

@valhalla , I hope you are well. sorry, I run the code pretty easy with this command and then checked the nvidia-smi all defined gpus are working (means 1,2,3). is it ok now? I shouldn’t do anything more inside the code? this is my code, can I trust the final model? how did you call your function?

CUDA_VISIBLE_DEVICES=“1,2,3” python casesummary_resolution_GPT_Neo_GPU_V5-125M-Trainer_v22.py

from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM, IntervalStrategy
from sklearn.model_selection import train_test_split

torch.manual_seed(42)

pretrained_model = '/home//GPT-NEO-125M/'

tokenizer = AutoTokenizer.from_pretrained(pretrained_model, bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')
model = AutoModelForCausalLM.from_pretrained(pretrained_model).cuda()

print(torch.cuda.current_device())


model.resize_token_embeddings(len(tokenizer))

descriptions = DataWhole_1

# max_length = max([len(tokenizer.encode(description)) for description in descriptions])

max_length=350
print("Max length: {}".format(max_length))


class NetflixDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]


dataset = NetflixDataset(descriptions, tokenizer, max_length=max_length)


train_dataset, val_dataset = train_test_split(dataset,test_size=.1,random_state=42,shuffle=False)

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=16,evaluation_strategy="steps", logging_strategy="steps",save_strategy="steps",save_steps=10000,seed=42,load_best_model_at_end=True,logging_steps=10000,
report_to="tensorboard",per_device_train_batch_size=4,eval_steps=10000,save_total_limit=2,per_device_eval_batch_size=4,
warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()


@aclifton314 , I hope you are well. sorry, I run the code pretty easy with this command and then checked the nvidia-smi all defined gpus are working (means 1,2,3). is it ok now? I shouldn’t do anything more inside the code? this is my code, can I trust the final model? how did you call your function?

CUDA_VISIBLE_DEVICES=“1,2,3” python casesummary_resolution_GPT_Neo_GPU_V5-125M-Trainer_v22.py

from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM, IntervalStrategy
from sklearn.model_selection import train_test_split

torch.manual_seed(42)

pretrained_model = '/home//GPT-NEO-125M/'

tokenizer = AutoTokenizer.from_pretrained(pretrained_model, bos_token='<|startoftext|>',
                                          eos_token='<|endoftext|>', pad_token='<|pad|>')
model = AutoModelForCausalLM.from_pretrained(pretrained_model).cuda()

print(torch.cuda.current_device())


model.resize_token_embeddings(len(tokenizer))

descriptions = DataWhole_1

# max_length = max([len(tokenizer.encode(description)) for description in descriptions])

max_length=350
print("Max length: {}".format(max_length))


class NetflixDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]


dataset = NetflixDataset(descriptions, tokenizer, max_length=max_length)


train_dataset, val_dataset = train_test_split(dataset,test_size=.1,random_state=42,shuffle=False)

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=16,evaluation_strategy="steps", logging_strategy="steps",save_strategy="steps",save_steps=10000,seed=42,load_best_model_at_end=True,logging_steps=10000,
report_to="tensorboard",per_device_train_batch_size=4,eval_steps=10000,save_total_limit=2,per_device_eval_batch_size=4,
warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()


@SUNM I think that should train your model.

can I trust the final model?

One should never trust the final model. One should always evaluate and inspect their model to build confidence that the final model is doing what the developer intended. There are a few ways to do this and the approach can vary across application.

The first thing I would check is whether or not the parameters of the model are being updated as the training proceeds. You can print out the model parameters before the Trainer.train() call and after to see if they have changed. You don’t need to inspect each and every parameter, just a subset of them and see if they differ.

The next thing I would make sure to have is the loss plot (also know as the learning curve). The is a plot of the losses of the model during training for each training step. Here is a detailed article describing this: https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/. What this tells you is whether or not your model is actually optimizing against the loss function. If that curve decreases over the number of training steps, one can say that the model’s predictions are getting better and better with respect to the training data and their labels.

The last thing I would check is whether or not the model is making predictions against the majority class or if it is just randomly guessing. This is more an evaluation method to be able to say the model has learned something beyond majority class prediction or randomly guessing. The DummyClassifier class in scikit learn makes this evaluation easy: sklearn.dummy.DummyClassifier — scikit-learn 1.3.2 documentation

For causal language modeling, a metric that may be of interest is perplexity . This can give you a metric that allows you to evaluate the “goodness” of your model.

how did you call your function?

I’m not sure what you mean here. If you’re referring to how to begin training of the model, it is as you have written in your code. Namely, calling the Trainer.train() method.

If you are referring to how you load your trained model for inference, I can’t recall exactly how that’s done but I know there are several posts on this forum that describe it. I would search for something like “load and call trained model” or “using finetuned model for inference”.

I hope this helps.

1 Like

@aclifton314 many thanks for your reply. Sorry, my result from multiple GPU by using Trainer API is very strange in comparison with using 1 GPU, would you please share a sample of code that u used multiple GPU and results was reasonable? many thanks.