Inconsistency in hyperparameter search results

I’m running a hyperparameter search/tuning on a BERT model using Ray engine and Trainer. This is how I’m doing it:

def compute_metrics(eval_predictions):
    predictions = eval_predictions.predictions[0] if isinstance(eval_predictions.predictions,
                                                                tuple) else eval_predictions.predictions
    label_ids = eval_predictions.label_ids
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}

def get_model():
    return AutoModelForMultipleChoice.from_pretrained(model_checkpoint)


training_args = TrainingArguments(
    output_dir=tuning_output_path,  # output directory
    evaluation_strategy="epoch",
    save_strategy="no",
    report_to="wandb",
    logging_dir='./logs',  # directory for storing logs
    disable_tqdm=True
)

trainer = Trainer(
    model_init=get_model,  # the instantiated 🤗 Transformers model to be trained
    args=training_args,  # training arguments, defined above
    train_dataset=dataset['train'],  # training dataset
    eval_dataset=dataset['dev'],  # evaluation dataset
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics
)

tune_config = {
    "per_device_train_batch_size": tune.grid_search([4, 8]),
    "num_train_epochs": tune.grid_search([4]),
    "learning_rate": tune.grid_search([2e-5, 3e-5])
}

best_trial = trainer.hyperparameter_search(
    hp_space=lambda _: tune_config,
    backend="ray",
    direction='maximize',
    n_trials=1,
    resources_per_trial={
        "cpu": 8,
        "gpu": 1
    },
    keep_checkpoints_num=0,
    local_dir="./ray_results/",
    log_to_file=True)

I’m doing hyperparameter tuning using grid search. Once the process starts, this is what I have in the output:

== Status ==
Current time: 2022-04-11 23:44:45 (running for 00:00:36.70)
Memory usage on this node: 8.5/51.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/26.3 GiB heap, 0.0/13.15 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /content/drive/MyDrive/colab/secret/ray_results/_objective_2022-04-11_23-44-08
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------+
| Trial name             | status   | loc            |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   objective |
|------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------|
| _objective_4818a_00000 | RUNNING  | 172.28.0.2:735 |           2e-05 |                  4 |                             4 |         0.7 |
| _objective_4818a_00001 | PENDING  |                |           3e-05 |                  4 |                             4 |             |
| _objective_4818a_00002 | PENDING  |                |           2e-05 |                  4 |                             8 |             |
| _objective_4818a_00003 | PENDING  |                |           3e-05 |                  4 |                             8 |             |
+------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------+

However, a couple of seconds later, I see the following in the output:

== Status ==
Current time: 2022-04-11 23:45:01 (running for 00:00:52.44)
Memory usage on this node: 8.5/51.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/26.3 GiB heap, 0.0/13.15 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /content/drive/MyDrive/colab/secret/ray_results/_objective_2022-04-11_23-44-08
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------+
| Trial name             | status   | loc            |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   objective |
|------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------|
| _objective_4818a_00000 | RUNNING  | 172.28.0.2:735 |           2e-05 |                  4 |                             4 |         0.6 |
| _objective_4818a_00001 | PENDING  |                |           3e-05 |                  4 |                             4 |             |
| _objective_4818a_00002 | PENDING  |                |           2e-05 |                  4 |                             8 |             |
| _objective_4818a_00003 | PENDING  |                |           3e-05 |                  4 |                             8 |             |
+------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------+

And at the very end of tuning process, I have the following results:

== Status ==
Current time: 2022-04-11 23:48:43 (running for 00:04:34.51)
Memory usage on this node: 8.1/51.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/1 GPUs, 0.0/26.3 GiB heap, 0.0/13.15 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /content/drive/MyDrive/colab/secret/ray_results/_objective_2022-04-11_23-44-08
Number of trials: 4/4 (4 TERMINATED)
+------------------------+------------+----------------+-----------------+--------------------+-------------------------------+-------------+
| Trial name             | status     | loc            |   learning_rate |   num_train_epochs |   per_device_train_batch_size |   objective |
|------------------------+------------+----------------+-----------------+--------------------+-------------------------------+-------------|
| _objective_4818a_00000 | TERMINATED | 172.28.0.2:735 |           2e-05 |                  4 |                             4 |        0.64 |
| _objective_4818a_00001 | TERMINATED | 172.28.0.2:739 |           3e-05 |                  4 |                             4 |        0.78 |
| _objective_4818a_00002 | TERMINATED | 172.28.0.2:737 |           2e-05 |                  4 |                             8 |        0.64 |
| _objective_4818a_00003 | TERMINATED | 172.28.0.2:738 |           3e-05 |                  4 |                             8 |        0.68 |
+------------------------+------------+----------------+-----------------+--------------------+-------------------------------+-------------+

As can be seen, for the very combination of batch size, epoch, and learning rate (4, 4, 2e-5, respectively,) first I was seeing a 0.7 objective which is the accuracy on my dev set, later I’m seeing 0.6, and at the very end I have 0.64. Am I missing something here? I want to understand why this is happening? And shouldn’t this objective be the same for the very combination during process?

I think this is happening: every trial is running for 4 epochs and you evaluate at the end of every epoch. So when you see this:

that means the trial is still running, and that the “objective” is the evaluation on the current-last evaluation. But as you can see, it is still RUNNING. And so after epoch 0, 1, 2, 3 you will get different results in “objective” because the model will still train and every epoch it is evaluated again, which is the score you see.

As you noticed, your score goes down. That is because Ray does not by default use the “best” evaluation checkpoint of each trial, but only the last one. So it is likely that you achieve the best result after one epoch and then start to overfit. I had a similar issue: it is possible to ask Ray to only report the “best” checkpoint of each trial and then pick the best one across these trials, but you’ll have to change the Transformers code a bit. I have posted such an issue + solution here: Option to change Ray's gridsearch scope · Issue #16683 · huggingface/transformers · GitHub. Be sure to give it a thumbs up if you are also looking for such functionality.

1 Like

Thanks for the clarification, now it makes total sense. I agree with the much-needed feature and gave a thumbs up, also it would be great if we can directly load the best checkpoint from hyperparameter tuning from transformers, not sure if we can do it now?