I’m running a hyperparameter search/tuning on a BERT model using Ray engine and Trainer. This is how I’m doing it:
def compute_metrics(eval_predictions):
predictions = eval_predictions.predictions[0] if isinstance(eval_predictions.predictions,
tuple) else eval_predictions.predictions
label_ids = eval_predictions.label_ids
preds = np.argmax(predictions, axis=1)
return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}
def get_model():
return AutoModelForMultipleChoice.from_pretrained(model_checkpoint)
training_args = TrainingArguments(
output_dir=tuning_output_path, # output directory
evaluation_strategy="epoch",
save_strategy="no",
report_to="wandb",
logging_dir='./logs', # directory for storing logs
disable_tqdm=True
)
trainer = Trainer(
model_init=get_model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=dataset['train'], # training dataset
eval_dataset=dataset['dev'], # evaluation dataset
data_collator=DataCollatorForMultipleChoice(tokenizer),
compute_metrics=compute_metrics
)
tune_config = {
"per_device_train_batch_size": tune.grid_search([4, 8]),
"num_train_epochs": tune.grid_search([4]),
"learning_rate": tune.grid_search([2e-5, 3e-5])
}
best_trial = trainer.hyperparameter_search(
hp_space=lambda _: tune_config,
backend="ray",
direction='maximize',
n_trials=1,
resources_per_trial={
"cpu": 8,
"gpu": 1
},
keep_checkpoints_num=0,
local_dir="./ray_results/",
log_to_file=True)
I’m doing hyperparameter tuning using grid search. Once the process starts, this is what I have in the output:
== Status ==
Current time: 2022-04-11 23:44:45 (running for 00:00:36.70)
Memory usage on this node: 8.5/51.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/26.3 GiB heap, 0.0/13.15 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /content/drive/MyDrive/colab/secret/ray_results/_objective_2022-04-11_23-44-08
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------+
| Trial name | status | loc | learning_rate | num_train_epochs | per_device_train_batch_size | objective |
|------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------|
| _objective_4818a_00000 | RUNNING | 172.28.0.2:735 | 2e-05 | 4 | 4 | 0.7 |
| _objective_4818a_00001 | PENDING | | 3e-05 | 4 | 4 | |
| _objective_4818a_00002 | PENDING | | 2e-05 | 4 | 8 | |
| _objective_4818a_00003 | PENDING | | 3e-05 | 4 | 8 | |
+------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------+
However, a couple of seconds later, I see the following in the output:
== Status ==
Current time: 2022-04-11 23:45:01 (running for 00:00:52.44)
Memory usage on this node: 8.5/51.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 8.0/8 CPUs, 1.0/1 GPUs, 0.0/26.3 GiB heap, 0.0/13.15 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /content/drive/MyDrive/colab/secret/ray_results/_objective_2022-04-11_23-44-08
Number of trials: 4/4 (3 PENDING, 1 RUNNING)
+------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------+
| Trial name | status | loc | learning_rate | num_train_epochs | per_device_train_batch_size | objective |
|------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------|
| _objective_4818a_00000 | RUNNING | 172.28.0.2:735 | 2e-05 | 4 | 4 | 0.6 |
| _objective_4818a_00001 | PENDING | | 3e-05 | 4 | 4 | |
| _objective_4818a_00002 | PENDING | | 2e-05 | 4 | 8 | |
| _objective_4818a_00003 | PENDING | | 3e-05 | 4 | 8 | |
+------------------------+----------+----------------+-----------------+--------------------+-------------------------------+-------------+
And at the very end of tuning process, I have the following results:
== Status ==
Current time: 2022-04-11 23:48:43 (running for 00:04:34.51)
Memory usage on this node: 8.1/51.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/8 CPUs, 0/1 GPUs, 0.0/26.3 GiB heap, 0.0/13.15 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /content/drive/MyDrive/colab/secret/ray_results/_objective_2022-04-11_23-44-08
Number of trials: 4/4 (4 TERMINATED)
+------------------------+------------+----------------+-----------------+--------------------+-------------------------------+-------------+
| Trial name | status | loc | learning_rate | num_train_epochs | per_device_train_batch_size | objective |
|------------------------+------------+----------------+-----------------+--------------------+-------------------------------+-------------|
| _objective_4818a_00000 | TERMINATED | 172.28.0.2:735 | 2e-05 | 4 | 4 | 0.64 |
| _objective_4818a_00001 | TERMINATED | 172.28.0.2:739 | 3e-05 | 4 | 4 | 0.78 |
| _objective_4818a_00002 | TERMINATED | 172.28.0.2:737 | 2e-05 | 4 | 8 | 0.64 |
| _objective_4818a_00003 | TERMINATED | 172.28.0.2:738 | 3e-05 | 4 | 8 | 0.68 |
+------------------------+------------+----------------+-----------------+--------------------+-------------------------------+-------------+
As can be seen, for the very combination of batch size, epoch, and learning rate (4
, 4
, 2e-5
, respectively,) first I was seeing a 0.7
objective which is the accuracy on my dev set, later I’m seeing 0.6
, and at the very end I have 0.64
. Am I missing something here? I want to understand why this is happening? And shouldn’t this objective be the same for the very combination during process?