Can we parallelize transformers fine-tuning on a Hadoop cluster?

Shahbaz · April 7, 2023, 4:14am

Hi, I’m relatively new to the transformers library and although I’ve studied deep learning a bit, I haven’t done much real world programming using PyTorch.

I would like to fine tune some language models on my company’s data. We don’t have access to GPUs yet. One of my colleagues trained a small model which took 10 days on a single computer.

We do have a large Hadoop/spark cluster and if we get good scalability, we can bring the 10 day training to 1 day - a game changer! (Or we could train larger models).

Is it possible to use our spark/Hadoop infrastructure to parallelize transformers? If Spark/Hadoop is irrelevant here, we still have a number of physical machines. What do I need to do to parallelize huggingface libraries? Install Ray? Does PyTorch distributed do this by itself?

Unfortunately most docs for distributed frameworks are weirdly lacking in details. They often suggest pip installing a package, but don’t provide an easy explanation of how to set up head vs worker nodes, how to configure communication across machines, etc.

Hoping someone can point me in the right direction.

Topic		Replies	Views
Transformers fine-tune architecture/code structure Beginners	0	343	September 28, 2021
Boilerplate for Trainer using torch.distributed Beginners	4	2038	January 11, 2022
Model Parallelism, how to parallelize transformer? Beginners	3	12704	June 18, 2021
Scaling up BERT-like model Inference on modern CPU - Part 1 Intermediate	3	1118	April 22, 2021
Frameworks for Benchmarking Transformers' Inference? 🤗Transformers	1	379	August 13, 2024

Can we parallelize transformers fine-tuning on a Hadoop cluster?

Related topics