How to load checkpoint shards with gaudi instead of cpu?

While fine-tuning llama2 70-B, we ran into memory issues without distributed computing. It’s somehow, expecting a node address and we can’t pass the node address as an environment variable.

So, we tried to acceelrate the CPUs by using the HPUs. But while doing that, we were failing to load checkpoint shards. How to load checkpoint shards with gaudi instead of cpu? We need to accelerate our CPUs with HPUs and load the checkpoiints shard with the device.

@regis sorry for the barrage of questions! :slight_smile:

@gildesh

It’s somehow, expecting a node address and we can’t pass the node address as an environment variable.

This should be only if you want to use several nodes. If you have a single instance (8 devices), no need to specify any node address.

Loading checkpoint shards should work with DeepSpeed, not sure without. Could you give me a command so that I can reproduce it?