GPU over head using by5g

Orr-z · August 25, 2024, 2:06pm

Hi there, I am using byt5 tokensclassification model (small) , and I keep getting “CUDA out of space” error after a while (changes according to the number of samples, usage of wandb for unknown reason).
I am using 3 GPU devices. This is the GPU usage:

Model:
self.model = AutoModelForTokenClassification.from_pretrained(“google/byt5-small”, num_labels=2)
self.model = AutoModelForTokenClassification.from_pretrained(“google/byt5-small”, num_labels=2, device_map=‘auto’)

    if torch.cuda.device_count() > 1:
        print(f"Using {torch.cuda.device_count()} GPUs")
    self.model = nn.DataParallel(self.model, device_ids=[0, 1, 2])

    # Move the model to CUDA
    self.model = self.model.cuda()
    self.trainer = None

            # Apply weight initialization
    self.initialize_weights()

Trainer args:
training_args = TrainingArguments(
output_dir=“./results”, # Directory to save the results
eval_strategy=“epoch”, # Evaluate the model at the end of each epoch
learning_rate=self.lr, # learning rate
per_device_train_batch_size=self.batch_size, # Batch size for training
per_device_eval_batch_size=self.batch_size, # Batch size for evaluation
# weight_decay=0.01, # Weight decay for regularization
save_total_limit=1, # Limit the total number of checkpoints saved during training (default is 5) - delete older checkpoints to save space
num_train_epochs=self.num_epochs, # Number of training epochs
logging_dir=‘./src/model/logs’, # Directory to save the logs
report_to=“none”, # Disable logging to external services like TensorBoard
fp16=True, # Enable mixed precision training
gradient_accumulation_steps=8, # Accumulate gradients for 8 steps before updating
# eval_accumulation_steps=10, # Accumulate evaluation steps
logging_steps=10, # Log every 10 steps
save_strategy=“epoch”, # Save the model at the end of each epoch
# load_best_model_at_end=True, # Load the best model at the end of training
run_name=“byt5-word-segmentation_” + str(start_time), # Name of the run,
remove_unused_columns=False,

Trainer:
self.trainer = Trainer(
model=self.model, # The instantiated Transformers model to be trained
args=training_args, # TrainingArguments
train_dataset=self.train_dataset, # Training dataset
eval_dataset=self.eval_dataset, # Evaluation dataset
tokenizer=self.tokenizer, # Tokenizer for the model
data_collator=data_collator, # Data collator
compute_metrics=compute_metrics # The function that computes metrics of interest

Error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 616.00 MiB. GPU 0 has a total capacity of 23.69 GiB of which 466.94 MiB is free. Including non-PyTorch memory, this process has 23.23 GiB memory in use. Of the allocated memory 21.82 GiB is allocated by PyTorch, and 160.61 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Topic		Replies	Views
Multi GPU Training with Trainer and TokenClassification Model 🤗Transformers	0	1520	July 21, 2023
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 39.56 GiB total capacity; 37.84 GiB already allocated; 242.56 MiB free; 37.96 GiB reserved in total by PyTorch) 🤗Transformers	2	5347	June 7, 2023
CUDA out of memory on multi-GPU 🤗Transformers	1	2649	March 6, 2024
Out of memory error when creating a lot of embeddings Models	2	5016	March 4, 2023
torch.cuda.OutOfMemoryError when evaluate while traning 🤗Transformers	0	510	October 8, 2023

GPU over head using by5g

Related topics