I am trying to train a T5-base on Colab with the TPU. I am using the official code to perform a fine-tuning on the T5-base (with my dataset), but the training with TPU is extremely slow! I’m using the offical code.
I am attaching the colab code with the various libraries I have installed: notebook.
Also, if I try to increase the batch size as >= 64, I get a memory error, as there seems to be only about 8 Gb available.
Can someone help me? Thank you!
Could you please post how you run this code? I mean, do you use
xla_spawn.py, are you running it inline on a notebook?
@finiteautomata I’m running it on Google colab, online with Google Colab PRO
@Gennaro sorry, I didn’t see your notebook link. You are missing the
xla_spawn.py part, that is, the code that makes your code run in a parallel fashion. You should add this:
python xla_spawn.py --num_cores 8 t5.py \
@finiteautomata so I need this code and run
python xla_spawn.py --num_cores 8 t5.py \ and all other args of t5.py ?
Exactly that. Try it and tell what happens
@finiteautomata I don’t know if it’s doing all well, I have this prints:
1- Running tokenizer on train dataset: 0% 0/30 [00:00<?, ?ba/s]WARNING:t5:Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
Where xla:0 e not 1, but maybe it’s for the tokenizer run
2- huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
To disable this warning, you can either:
- Avoid using
tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Now it’s in finetuning on step 1.
@finiteautomata I see that at start xla = 1, then xla = 0. Furthermore, the training seems to take a long time, about 90 hours (on gpu they become less than 2); maybe somehow it uses the CPU?
@Gennaro I have the same problem when using the TPU in Colab (I have Google Colab Pro +). I was not using the
xla_spawn.py, I gave it a try and interestingly, the first time I did run my script using the
xla_spawn.py it made my training faster, however, after terminating my node and reconnecting to the TPU, I cannot make it work again and even using the
xla_spawn.py the training is very slow (so it was kind of random and I can’t reproduce).
Did you figure something out?
For folks who are still struggling, I think I found one potential reason for why training on TPU is slow, look here (I set padding to
True in my tokenizer and I’m already seeing a speedup in my TPU training, basically it looks like it didn’t have anything to do with my torch xla installation)
Thank you @phosseini for your answer! Yes I had the same issue with the randomness. I have read the discussion and it is interesting, so it will be for
padding = True (I was using False if I’m not mistaken).
Also I have your same question (if you just need to put
padding = False), I wait for answer in the other thread.
@phosseini I tried with
pad_to_max_length=True and it’s working fine.
Did you guys notice speedups vs GPU training? @Gennaro I have paid access to A100 GPUs but for side research tasks I’d like to use TPUs in case something works out…
@deathcrush The tpu is much faster VS gpu training with a A100, P100, V100 .