Multinode DeepSpeed T5 Experiment Issues with Hf-Trainer

I am training t5 LM with 4 p3.16xlarge aws node (8 V100 per node). I’m quite new to multi-node experiments. Not sure if this is an issue with hf-trainer or I’m missing something.

Some logging issues:

  1. Surprisingly when I run (with hf-trainer) my code in a single node, terminal can successfully track tqdm logging (progress bar). But when I run multinode experiment with deepspeed, head node terminal doesn’t print tqdm logging (disable_tqdm is always 0) . But I can only see it in the wandb webserver terminal bar which often gets frozen. So I actually get confused if my training gets frozen because of some weird zero3 config or wandb is actually frozen.
  2. I tried to redirect the transformer logger to my custom logger in the way mentioned here but it didn’t work. It could not log the deepspeed and tqdm logging (progress bar). Basically what I wanted is to do is redirect each of the node’s logging to a different physical file (using FileHandler).

DeepSpeed related issue,

  1. If I recall correctly, right now DeepSpeed zero3 doesn’t support pytorch checkpoint. Is there any hack/script to evaluate the model in my preferred dataset after the training has been done? I can see that the pytorch checkpoint is a very small file and the main data is in the global* folder with specific device signature (total dist_world_size x 2 files per checkpoint). How can I load these checkpoints? Looking at the checkpoint files, this is kind of a big deal because these checkpoint may not be loaded in a different dist_world_size.
  2. Is there anyway I can use bfloat16 in V100 GPUs? or bfloat16 is only supported in Ampere GPUs?
1 Like

@stas @sgugger Hope you guyz don’t mind tagging you here.

Hi @sbmaruf,

It’s the best to open an Issue for such reports, so that it’s easier to track. And each issue separately.

And with any Issue please report the environment, a simple reproducible script, the launch command and the log, as I don’t know know what exactly you’re doing based on your description in order to help you.

If recall correctly, right now DeepSpeed zero3 doesn’t support pytorch checkpoint.

it always did.

You’re talking about the intermediary checkpoint that deepspeed works with, which is indeed in a sharded version so that it’s fast to save and load. and that checkpoint is indeed hardwired to the topology you trained on.

If you don’t mind losing the optim states, you can extract the fp32 weights back into the torch checkpoint and resume in a different topology. there should be zero_to_fp32.py in the checkpoint folder that will perform that for you.

A work has been started on creating a universal checkpoint and currently supports only ZeRO-1/bf16 (bigscience fork of Megatron-Deepspeed), which can be reshaped to any topology w/o losing optimizer states, but the rest of ZeRO stages aren’t supported yet. This work is driven by a need. We needed it for the BLOOM training so we worked on implementing it. Someone with the need for other ZeRO stages should open a feature request on the deepspeed github and work with Tunji to port the existing support to other stages.

Is there anyway I can use bfloat16 in V100 GPUs? or bfloat16 is only supported in Ampere GPUs?

That’s correct, Ampere and higher provides hardware support for bfloat16.

There is also IPEX which support bfloat16 on the recent CPUs (integrated into the training)

1 Like