I’m trying to finetune the Facebook BART model, I’m following this article in order to classify text using my own dataset.
And I’m using the Trainer object in order to train:
training_args = TrainingArguments(
output_dir=model_directory, # output directory
num_train_epochs=1, # total number of training epochs - 3
per_device_train_batch_size=4, # batch size per device during training - 16
per_device_eval_batch_size=16, # batch size for evaluation - 64
warmup_steps=50, # number of warmup steps for learning rate scheduler - 500
weight_decay=0.01, # strength of weight decay
logging_dir=model_logs, # directory for storing logs
logging_steps=10,
)
model = BartForSequenceClassification.from_pretrained("facebook/bart-base") # bart-large-mnli
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
compute_metrics=new_compute_metrics, # a function to compute the metrics
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
This is the tokenizer I used:
from transformers import BartTokenizerFast
tokenizer = BartTokenizerFast.from_pretrained('facebook/bart-base')
But when I use trainer.train()
I get the following:
Printing the following:
***** Running training *****
Num examples = 172
Num Epochs = 1
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 11
Followed by this error:
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/databricks/python/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 1496, in forward
outputs = self.model(
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 1222, in forward
encoder_outputs = self.encoder(
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 846, in forward
layer_outputs = encoder_layer(
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 323, in forward
hidden_states, attn_weights, _ = self.self_attn(
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/transformers/models/bart/modeling_bart.py", line 191, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/databricks/python/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
I’ve searched this site and GitHub and Stackoverflow but still didn’t find anything that helped me fix this for me (I tried adding more memory, lowering batches and warmup, restarting, specifying CPU or GPU, and more, but none worked for me)
I’m using Databricks for that, with the cluster: Standard_NC24s_v3 with 4 GPUs, 2 to 6 workers
If you require any other information, comment and I’ll add it up asap