Hi, wanted to know when can we expect, Trainer API to use TPU’s.
Can i implement it myself? Give me some tips to where to start from
Let me know,
Kind regards
Hi, wanted to know when can we expect, Trainer API to use TPU’s.
Can i implement it myself? Give me some tips to where to start from
Let me know,
Kind regards
The Trainer API does support TPUs. For example, the language modeling examples can be run on TPU. There’s one thing to take into account when training on TPUs:
Note: On TPU, you should use the flag
--pad_to_max_length
in conjunction with the--line_by_line
flag to make sure all your batches have the same length.
You can take a look at the scripts for details.
Hi Nielsr,
I tried running the WNUT17 Trainer example in torch on both Kaggle and Colab TPUs, and neither seem to be working (although I made sure the TPUs were correctly configured and XLA was correctly installed).
Here’s the colab notebook: Google Colab
Here’s the Kaggle notebook: https://www.kaggle.com/xhlulu/huggingface-wnut17-tpu-tests?scriptVersionId=83062978
As you can see, each iteration takes significantly more time than it should be on GPU (total training time is ~1.5min on P100).
You are not padding your inputs and targets to a fixed size in this example, but dynamically padding them to the longest input/target in each batch. This cause the TPU to recompile at each step, so it’s normal you see a very long training time compared to GPUs.
To properly train on TPU, you need to apply fixed padding in tokenize_and_align_labels
to a given length of your choice, and pad the labels to that same length.
This was really helpful, thanks. Just one follow up on this, if we’re using a data collator who gets a tokenizer as its parameter (e.g., DataCollatorForLanguageModeling
), if we set padding=True
for the tokenizer
and then pass it to the data collator, is it going to have the same effect? (I did this and I’m already seeing a speed-up in TPU training, but not sure if it’s really because of making padding in tokenizer true.)
@sgugger Also, two quick questions, and I appreciate your input:
pad_to_multiple_of
in the data collator to make it compatible with the TPU instead of potentially changing the tokenization or the data collating process?