Best way to use accelerate for large embeddings

What is the best way to use accelerate to train huge embedding matrices?

How do we effectively split it into multiple devices and initialize non-empty weights? I want to be able to access a batch of embeddings and move it to the GPU for each step. Would it just be best to initialize on the CPU? Is there a way I can use the strategies that use disk, ram, and GPU?