I am trying to compare the inference time of different Huggingface models with different batch sizes. But am not sure if I should use per_device_train_batch_size, per_device_eval_batch_size, both or some other method. What is the correct way to do it with the trainer API?