Hi,
I’m training a large GPT2 based causal language model on multiple GPUs using pytorch’s FullyShardedDataParallel (FSDP) strategy. I enabled FSDP in HuggingFace Trainer
by passing
the following arguments:
"fsdp": "full_shard auto_wrap"
"fsdp_config": {
"fsdp_transformer_layer_cls_to_wrap": ["GPT2Block"]
}
With FSDP, the model can be distributed into multiple GPUs with shards and it is successfully trained. Now I want to add an evaluation step to the trainer. I don’t just want to compute the perplexity or accuracy score by getting the argmax of each logit. I want to do an end-to-end evaluation by calling the model’s generate
method and generate outputs autoregressively. I couldn’t figure out a way to call model.generate
, or equivalent methods in the evaluation step. Below are what I have tried.
To do this custom evaluation, I subclassed the Trainer
class:
class CustomTrainer(Trainer):
def evaluate(
self,
eval_dataset = None,
ignore_keys = None,
metric_key_prefix: str = "eval",
):
# only take one example for illustration
input_ids = torch.tensor([self.eval_dataset[0]['input_ids']]).to(f"cuda:{self.args.local_rank}")
output = self.model.generate(input_ids)
return {"my_fancy_metric": 1.0}
If I don’t have the .to(f"cuda:{self.args.local_rank}")
part, I will get an error message saying:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument index in method wrapper__index_select)
This is understandable, since the input_ids
tensor is on cpu and the models are distributed on different GPUs. But after adding .to(f"cuda:{self.args.local_rank}")
, I got:
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
I also tried calling the pipeline("text-generation")
with text input but got the same behavior.
So how can I properly call the model.generate
method in evaluation steps with Trainer and FSDP?