When can we expect TPU Trainer?

Hi, wanted to know when can we expect, Trainer API to use TPU’s.

Can i implement it myself? Give me some tips to where to start from

Let me know,
Kind regards

1 Like

The Trainer API does support TPUs. For example, the language modeling examples can be run on TPU. There’s one thing to take into account when training on TPUs:

Note: On TPU, you should use the flag --pad_to_max_length in conjunction with the --line_by_line flag to make sure all your batches have the same length.

You can take a look at the scripts for details.


Hi Nielsr,

I tried running the WNUT17 Trainer example in torch on both Kaggle and Colab TPUs, and neither seem to be working (although I made sure the TPUs were correctly configured and XLA was correctly installed).

Here’s the colab notebook: Google Colab

Here’s the Kaggle notebook: https://www.kaggle.com/xhlulu/huggingface-wnut17-tpu-tests?scriptVersionId=83062978

As you can see, each iteration takes significantly more time than it should be on GPU (total training time is ~1.5min on P100).

1 Like

You are not padding your inputs and targets to a fixed size in this example, but dynamically padding them to the longest input/target in each batch. This cause the TPU to recompile at each step, so it’s normal you see a very long training time compared to GPUs.

To properly train on TPU, you need to apply fixed padding in tokenize_and_align_labels to a given length of your choice, and pad the labels to that same length.

1 Like

This was really helpful, thanks. Just one follow up on this, if we’re using a data collator who gets a tokenizer as its parameter (e.g., DataCollatorForLanguageModeling), if we set padding=True for the tokenizer and then pass it to the data collator, is it going to have the same effect? (I did this and I’m already seeing a speed-up in TPU training, but not sure if it’s really because of making padding in tokenizer true.)

@sgugger Also, two quick questions, and I appreciate your input:

  • Is there any way to speed up training on TPU with dynamic padding as well?
  • Can we use the pad_to_multiple_of in the data collator to make it compatible with the TPU instead of potentially changing the tokenization or the data collating process?