Multiple Perturbs on Same Trial but no more Checkpoints with PBT using RayTune

  • Python version: 3.9.6

  • Transformers version: 4.11.3

  • Ray version: 1.7.0

  • Model Type: BART-Large

  • Task: Summarization

  • Using PyTorch Trainer

Hi @patrickvonplaten,
First off, thank you for helping creating the fantastic huggingface library. My issue is that I’m trying to finetune a bart-large model on a summarization task, and I am using the population-based training (PBT) scheduler from Ray Tune for this purpose. However, there are some cases (not all the time) when a trial is stuck, “perturbing” the same trial without creating any more checkpoints. The “perturb” is different every time (e.g., a new learning_rate is used). This constant “perturbing” of the same trial continues without end, until the disk quota is exceeded and the project fails.

Here is the most relevant part of the code:


SEED = 4


# Define Training Arguments

training_args = Seq2SeqTrainingArguments(
    do_train = True,
    evaluation_strategy = "epoch", 
    eval_accumulation_steps=4,
    fp16 = True,
    gradient_accumulation_steps = 4,
    learning_rate = 2.232e-5,
    load_best_model_at_end=True,
    logging_strategy = 'epoch',
    metric_for_best_model = 'eval_rouge1',
    num_train_epochs = 5,
    output_dir = 'experimental_summarization_runs',
    overwrite_output_dir = True,
    per_device_eval_batch_size = 1,
    per_device_train_batch_size = 1,
    predict_with_generate = True,
    remove_unused_columns=True,
    report_to="wandb",
    save_strategy = 'epoch',
    save_total_limit = 1,
    seed = SEED,
    warmup_ratio =0.06,
    weight_decay = 0.01,
)


# In[82]:


# Define Trainer for Hyperparameter Tuning only on portion of train and eval datasets
trainer = Seq2SeqTrainer(
    args = training_args,
    tokenizer = tokenizer,
    train_dataset = encoded_train_dataset.shuffle(seed=SEED).select(range(700)),
    eval_dataset = encoded_eval_dataset.shuffle(seed=SEED).select(range(100)),
    model_init = model_init,
    compute_metrics=compute_metrics,
)

# Create Scheduler to use for Hyperparameter Search
scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        metric='eval_rouge1',
        mode="max",
        perturbation_interval=1,
        hyperparam_mutations={
            "learning_rate": tune.loguniform(1e-6, 1e-2),
            "per_device_train_batch_size": [1],
            "per_device_eval_batch_size": [1],
            "gradient_accumulation_steps": [4],
            "eval_accumulation_steps": [4],
        })

tune_config = {
    "seed": tune.choice(list(range(1, 42))),
    "num_train_epochs": tune.grid_search([5]),
}


# ## Run Hyperparameter Search

# In[ ]:


# Run Hyperparameter Search and save best trial
best_trial = trainer.hyperparameter_search(
    hp_space = lambda _: tune_config,
    direction="maximize",
    backend="ray",
    keep_checkpoints_num=1,
    checkpoint_score_attr='eval_rouge1',
    local_dir = 'hyperparam_results',
    raise_on_failed_trial = False,
    resources_per_trial = {"cpu": 56, "gpu": 4}, 
    n_trials=5,  
    scheduler=scheduler,
)

Here are hopefully relevant snipits from the stdout file:





Along with some snipits from the .err file:



Just as reference, I was able to run the exact same code with the BART-base model successfully. I’m worried if this has something to do with memory requirements.