Trainer with TPUs

Kioto97 · April 12, 2022, 8:59am

Hello, I’ve been trying to train a BERT-based model for SQuAD with the Trainer API, and I would like to use TPUs from COLAB to speed-up the training.
I have followed the official notebook for training a SQuAD model, and then, I have followed the suggestions on this older topic:
Tpu Trainer
Which says the only thing to edit is to set a max_length parameter for the padding, which I already did.

So what I did is basically:

Install the xla library
Set max_length for padding
Set TPU environment in COLAB
But I don’t get any speedup for training, am I missing something? This is my code:
My code

Edit: I made some progress! I have to restart the runtime as soon as I install the XLA library, otherwise there are compatibility issues with older versions.
But still, I didn’t get the speed-up I hoped for, if I do 0.8 iterations/second with a GPU, I only get 3.2 iterations/second with TPU, is it normal?
Also I get the following data:

Num examples = 66107
Num Epochs = 3
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16

Is it normal? To my understanding, each TPU should have 8 cores, so if I have a batch size of 16 per each devices, the total train batch size should be 128? How do I use all the cores? Thanks

BramVanroy · April 12, 2022, 4:21pm

Looking at the source code, it might be that you need to specify the tpu_num_cores, assuming that the notebook/script in your case does not set this automatically.

github.com

huggingface/transformers/blob/cc034f72eb6137f4c550e911fba67f8a0e1e98fa/src/transformers/training_args.py#L258-L259

      
        
            tpu_num_cores (`int`, *optional*):
                When training on TPU, the number of TPU cores (automatically passed by launcher script).

Kioto97 · April 12, 2022, 8:42pm

Thank you for the answer Bram! Unfortunately that’s not the case, I think by default the parameter is set to 8, since even tuning it manually doesn’t change anything.

BramVanroy · April 13, 2022, 8:34am

In that case, I am not sure if there are any other parameters that may improve your speed more. Perhaps someone else can help!

Topic		Replies	Views
TPU trainer with multi-core Intermediate	5	2204	April 21, 2022
When can we expect TPU Trainer? 🤗Transformers	4	4052	March 3, 2022
How to use TPU for BERT training Colab Beginners	1	955	July 30, 2022
Where to set the Evaluation Batch Size in Trainer Beginners	2	8635	June 17, 2022
TPU slow finetuning T5-base Models	13	3055	June 17, 2022

Trainer with TPUs

Related topics