Multiple Perturbs on Same Trial but no more Checkpoints with PBT using RayTune

attalk · October 24, 2021, 7:16pm

Python version: 3.9.6
Transformers version: 4.11.3
Ray version: 1.7.0
Model Type: BART-Large
Task: Summarization
Using PyTorch Trainer

Hi @patrickvonplaten,
First off, thank you for helping creating the fantastic huggingface library. My issue is that I’m trying to finetune a bart-large model on a summarization task, and I am using the population-based training (PBT) scheduler from Ray Tune for this purpose. However, there are some cases (not all the time) when a trial is stuck, “perturbing” the same trial without creating any more checkpoints. The “perturb” is different every time (e.g., a new learning_rate is used). This constant “perturbing” of the same trial continues without end, until the disk quota is exceeded and the project fails.

Here is the most relevant part of the code:


SEED = 4


# Define Training Arguments

training_args = Seq2SeqTrainingArguments(
    do_train = True,
    evaluation_strategy = "epoch", 
    eval_accumulation_steps=4,
    fp16 = True,
    gradient_accumulation_steps = 4,
    learning_rate = 2.232e-5,
    load_best_model_at_end=True,
    logging_strategy = 'epoch',
    metric_for_best_model = 'eval_rouge1',
    num_train_epochs = 5,
    output_dir = 'experimental_summarization_runs',
    overwrite_output_dir = True,
    per_device_eval_batch_size = 1,
    per_device_train_batch_size = 1,
    predict_with_generate = True,
    remove_unused_columns=True,
    report_to="wandb",
    save_strategy = 'epoch',
    save_total_limit = 1,
    seed = SEED,
    warmup_ratio =0.06,
    weight_decay = 0.01,
)


# In[82]:


# Define Trainer for Hyperparameter Tuning only on portion of train and eval datasets
trainer = Seq2SeqTrainer(
    args = training_args,
    tokenizer = tokenizer,
    train_dataset = encoded_train_dataset.shuffle(seed=SEED).select(range(700)),
    eval_dataset = encoded_eval_dataset.shuffle(seed=SEED).select(range(100)),
    model_init = model_init,
    compute_metrics=compute_metrics,
)

# Create Scheduler to use for Hyperparameter Search
scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        metric='eval_rouge1',
        mode="max",
        perturbation_interval=1,
        hyperparam_mutations={
            "learning_rate": tune.loguniform(1e-6, 1e-2),
            "per_device_train_batch_size": [1],
            "per_device_eval_batch_size": [1],
            "gradient_accumulation_steps": [4],
            "eval_accumulation_steps": [4],
        })

tune_config = {
    "seed": tune.choice(list(range(1, 42))),
    "num_train_epochs": tune.grid_search([5]),
}


# ## Run Hyperparameter Search

# In[ ]:


# Run Hyperparameter Search and save best trial
best_trial = trainer.hyperparameter_search(
    hp_space = lambda _: tune_config,
    direction="maximize",
    backend="ray",
    keep_checkpoints_num=1,
    checkpoint_score_attr='eval_rouge1',
    local_dir = 'hyperparam_results',
    raise_on_failed_trial = False,
    resources_per_trial = {"cpu": 56, "gpu": 4}, 
    n_trials=5,  
    scheduler=scheduler,
)

Here are hopefully relevant snipits from the stdout file:

…

Along with some snipits from the .err file:

Just as reference, I was able to run the exact same code with the BART-base model successfully. I’m worried if this has something to do with memory requirements.

Topic		Replies	Views
Facebook BART Fine-tuning - Transformers CUDA error: CUBLAS_STATUS_NOT_INITIALIZE 🤗Transformers	4	1762	May 2, 2023
Perplexity for BART summaries Beginners	1	1482	February 11, 2022
Using trainer to train a bart model on 4 gpus failed 🤗Transformers	0	338	March 16, 2022
BART finetuning for summarization without seq2seq trainer Beginners	1	818	October 31, 2022
Multi GPU fintuning BART 🤗Transformers	3	1650	July 11, 2020

Multiple Perturbs on Same Trial but no more Checkpoints with PBT using RayTune

Related topics