TPU slow finetuning T5-base

I am trying to train a T5-base on Colab with the TPU. I am using the official code to perform a fine-tuning on the T5-base (with my dataset), but the training with TPU is extremely slow! I’m using the offical code.

I am attaching the colab code with the various libraries I have installed: notebook.

Also, if I try to increase the batch size as >= 64, I get a memory error, as there seems to be only about 8 Gb available.

Can someone help me? Thank you!

1 Like

Could you please post how you run this code? I mean, do you use, are you running it inline on a notebook?

@finiteautomata I’m running it on Google colab, online with Google Colab PRO

@Gennaro sorry, I didn’t see your notebook link. You are missing the part, that is, the code that makes your code run in a parallel fashion. You should add this:

python --num_cores 8 \
    --model_name_or_path="t5-base" \
    --do_train \
    --do_eval \

etc etc

@finiteautomata so I need this code and run

python --num_cores 8 \ and all other args of ?

Exactly that. Try it and tell what happens

@finiteautomata I don’t know if it’s doing all well, I have this prints:

1- Running tokenizer on train dataset: 0% 0/30 [00:00<?, ?ba/s]WARNING:t5:Process rank: -1, device: xla:0, n_gpu: 0distributed training: False, 16-bits training: False
Where xla:0 e not 1, but maybe it’s for the tokenizer run

2- huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Now it’s in finetuning on step 1.

@finiteautomata I see that at start xla = 1, then xla = 0. Furthermore, the training seems to take a long time, about 90 hours (on gpu they become less than 2); maybe somehow it uses the CPU?

@Gennaro I have the same problem when using the TPU in Colab (I have Google Colab Pro +). I was not using the, I gave it a try and interestingly, the first time I did run my script using the it made my training faster, however, after terminating my node and reconnecting to the TPU, I cannot make it work again and even using the the training is very slow (so it was kind of random and I can’t reproduce).

Did you figure something out?

For folks who are still struggling, I think I found one potential reason for why training on TPU is slow, look here (I set padding to True in my tokenizer and I’m already seeing a speedup in my TPU training, basically it looks like it didn’t have anything to do with my torch xla installation)

1 Like

Thank you @phosseini for your answer! Yes I had the same issue with the randomness. I have read the discussion and it is interesting, so it will be for padding = True (I was using False if I’m not mistaken).

Also I have your same question (if you just need to put padding = False), I wait for answer in the other thread.

@phosseini I tried with pad_to_max_length=True and it’s working fine.

1 Like

Did you guys notice speedups vs GPU training? @Gennaro I have paid access to A100 GPUs but for side research tasks I’d like to use TPUs in case something works out…

@deathcrush The tpu is much faster VS gpu training with a A100, P100, V100 .

1 Like