What is the best way to use accelerate to train huge embedding matrices?
How do we effectively split it into multiple devices and initialize non-empty weights? I want to be able to access a batch of embeddings and move it to the GPU for each step. Would it just be best to initialize on the CPU? Is there a way I can use the strategies that use disk, ram, and GPU?