GPU over head using by5g

Hi there, I am using byt5 tokensclassification model (small) , and I keep getting “CUDA out of space” error after a while (changes according to the number of samples, usage of wandb for unknown reason).
I am using 3 GPU devices. This is the GPU usage:

Model:
self.model = AutoModelForTokenClassification.from_pretrained(“google/byt5-small”, num_labels=2)
self.model = AutoModelForTokenClassification.from_pretrained(“google/byt5-small”, num_labels=2, device_map=‘auto’)

    if torch.cuda.device_count() > 1:
        print(f"Using {torch.cuda.device_count()} GPUs")
    self.model = nn.DataParallel(self.model, device_ids=[0, 1, 2])

    # Move the model to CUDA
    self.model = self.model.cuda()
    self.trainer = None

            # Apply weight initialization
    self.initialize_weights()

Trainer args:
training_args = TrainingArguments(
output_dir=“./results”, # Directory to save the results
eval_strategy=“epoch”, # Evaluate the model at the end of each epoch
learning_rate=self.lr, # learning rate
per_device_train_batch_size=self.batch_size, # Batch size for training
per_device_eval_batch_size=self.batch_size, # Batch size for evaluation
# weight_decay=0.01, # Weight decay for regularization
save_total_limit=1, # Limit the total number of checkpoints saved during training (default is 5) - delete older checkpoints to save space
num_train_epochs=self.num_epochs, # Number of training epochs
logging_dir=‘./src/model/logs’, # Directory to save the logs
report_to=“none”, # Disable logging to external services like TensorBoard
fp16=True, # Enable mixed precision training
gradient_accumulation_steps=8, # Accumulate gradients for 8 steps before updating
# eval_accumulation_steps=10, # Accumulate evaluation steps
logging_steps=10, # Log every 10 steps
save_strategy=“epoch”, # Save the model at the end of each epoch
# load_best_model_at_end=True, # Load the best model at the end of training
run_name=“byt5-word-segmentation_” + str(start_time), # Name of the run,
remove_unused_columns=False,

Trainer:
self.trainer = Trainer(
model=self.model, # The instantiated :hugs: Transformers model to be trained
args=training_args, # TrainingArguments
train_dataset=self.train_dataset, # Training dataset
eval_dataset=self.eval_dataset, # Evaluation dataset
tokenizer=self.tokenizer, # Tokenizer for the model
data_collator=data_collator, # Data collator
compute_metrics=compute_metrics # The function that computes metrics of interest

Error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 616.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 466.94 MiB is free. Including non-PyTorch memory, this process has 23.23 GiB memory in use. Of the allocated memory 21.82 GiB is allocated by PyTorch, and 160.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)