Unusal pattern of CUDA out of error when using hyperparameter search (optuna backend)

So, I have been doing some experiments with hyperparameter search. So I thought I would write a simple function to find the max supported batch size to constrain the hp_space in terms of train batch size. Here is what the function looks like

def auto_select_batchsize(model, tokenizer, data, max_batch_size):
    batch_size = max_batch_size
    while True:
        try:
            print(f"using batch size: {batch_size}")
            training_args = training_args = TrainingArguments(
                output_dir="/tmp/results",
                per_device_train_batch_size=batch_size,
                num_train_epochs=1,
                logging_dir="/tmp/logs",
            )
            trainer = Trainer(
                model=model,
                args=training_args,
                tokenizer=tokenizer,
                train_dataset = data
            )

            trainer.train()
        except Exception as e:
            if "CUDA out of memory" in str(e):
                print(f"Lowering batch size from: {batch_size} -> {batch_size // 2}")
                batch_size //= 2
            else:
                raise e
        if batch_size == 0:
            print("Problem with batchsize")
            return None
    return batch_size

Then I use this function and here is the output

auto_select_batchsize(model, tokenizer, tokenized_data['train'], max_batch_size=256)

This is the output it gives

using batch size: 256
Lowering batch size from: 256 -> 128
using batch size: 128
Lowering batch size from: 128 -> 64
using batch size: 64
Lowering batch size from: 64 -> 32
using batch size: 32
 [ 5/782 00:04 < 20:20, 0.64 it/s, Epoch 0.01/1]
Step	Training Loss

So, I thought batch size 32 is the max here, now here is my hp space search pipeline
But I saw a weird behavior, initially with batch size 32, it was giving me cuda out of memory error, even it returned me 0 (this means it started lowering the batch size from 32 → 16 … → 0 and for every batch size going under COO exception … but after this I tried to run once more, and it started to work…

import optuna

def objective(trial):
    # Define the training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        per_device_train_batch_size=trial.suggest_categorical("per_device_train_batch_size", [2, 4, 8, 16, 32]),
        num_train_epochs=trial.suggest_int("num_train_epochs", 1, 2),
        evaluation_strategy="steps",
        eval_steps=1,
        logging_dir="./logs",
    )

    # Create a Trainer with the defined arguments
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_data["train"],
        eval_dataset=tokenized_data["test"],
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

    # Train the model
    trainer.train()

    # Evaluate the model
    eval_results = trainer.evaluate()

    # Return the evaluation metric (e.g., accuracy) as the objective
    return eval_results["eval_accuracy"]

# Create an Optuna study and optimize hyperparameters
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=2)

# Get the best hyperparameters
best_params = study.best_params
print("Best Hyperparameters:", best_params)

Here also, this gives me cuda out of error initially. Now here is where the weird behavior starts. I lowered the batch size to 4, it was running fine, Now I interrupted the process and I increased the batch to 16, and now it suddenly starts to run.

Can anyone explain this behaviour ?