Can we parallelize transformers fine-tuning on a Hadoop cluster?

Hi, I’m relatively new to the transformers library and although I’ve studied deep learning a bit, I haven’t done much real world programming using PyTorch.

I would like to fine tune some language models on my company’s data. We don’t have access to GPUs yet. One of my colleagues trained a small model which took 10 days on a single computer.

We do have a large Hadoop/spark cluster and if we get good scalability, we can bring the 10 day training to 1 day - a game changer! (Or we could train larger models).

Is it possible to use our spark/Hadoop infrastructure to parallelize transformers? If Spark/Hadoop is irrelevant here, we still have a number of physical machines. What do I need to do to parallelize huggingface libraries? Install Ray? Does PyTorch distributed do this by itself?

Unfortunately most docs for distributed frameworks are weirdly lacking in details. They often suggest pip installing a package, but don’t provide an easy explanation of how to set up head vs worker nodes, how to configure communication across machines, etc.

Hoping someone can point me in the right direction.