AutoModel Classifier distilBERT on Parallel GPUs

wdavies · November 13, 2024, 2:48am

We’re training a model for binary classification. We’re trying to use
AutoModelForSequenceClassification.from_pretrained(“distilbert-base-uncased”)

and torchrun --nproc-per-node=2

But we can’t figure out if we’re being bottle-necked by our PCIe 3.0 bus, which only a 1 GB/s capacity, and the model itself is 0.25GB. Does anyone know if :

(a) the AutoModel is freezing all but the classification head weights (in which case, only a small number of gradients need to be shared between the GPUs), or is it computing gradients for every weight in the distilBERT model (and thus requiring ~0.25GB x2 to be transferred every batch ?

(b) If the model is effectively frozen, and only the top layer is being adjusted, will torchrun be smart enough to know this, or will it attempt to send lots of zero gradients for the transformer part which is frozen s as well as the gradients for the classification layer.

Thanks in advance,
W

Topic		Replies	Views
Trainer API for Model Parallelism using AutoModelForQuestionAnswering 🤗Transformers	1	153	June 5, 2024
How to run an end to end example of distributed data parallel with hugging face's trainer api (ideally on a single node multiple gpus)? Intermediate	17	18051	September 6, 2023
Distributed training large models on cloud resources Beginners	6	794	March 27, 2024
Custom Distilbert does not use CUDA for predition Beginners	7	792	July 27, 2021
Auto Vs DistilBert for Classification : Accuracy/F1 varies a lot Beginners	0	309	March 31, 2022

AutoModel Classifier distilBERT on Parallel GPUs

Related topics