We’re training a model for binary classification. We’re trying to use
AutoModelForSequenceClassification.from_pretrained(“distilbert-base-uncased”)
and torchrun --nproc-per-node=2
But we can’t figure out if we’re being bottle-necked by our PCIe 3.0 bus, which only a 1 GB/s capacity, and the model itself is 0.25GB. Does anyone know if :
(a) the AutoModel is freezing all but the classification head weights (in which case, only a small number of gradients need to be shared between the GPUs), or is it computing gradients for every weight in the distilBERT model (and thus requiring ~0.25GB x2 to be transferred every batch ?
(b) If the model is effectively frozen, and only the top layer is being adjusted, will torchrun be smart enough to know this, or will it attempt to send lots of zero gradients for the transformer part which is frozen s as well as the gradients for the classification layer.
Thanks in advance,
W