task prefixes matter when
1. When doing multi-task training
2. When your task similar or related to one of the supervised tasks used in T5 pre-training mixture.
Needs slightly higher LR than the default one set in Trainer, in my experiments 1e-4 and 3e-4 worked for almost all problems (classification, QA, que-gen, summ)
no need to pass decoder_input_ids to T5 yourself, just pass labels and the T5Model will prepare them for you. labels should end with eos_token. (important! This is where most of the mistakes are happening).
T5 uses pad_token as the decoder_start_token_id so when doing generation without the generate function make sure you start it with pad token.
trimming batches when training on TPU leads to very slower training.
apparently, because of sentencepiece and some possible leakage of other languages in C4 data, T5 gives somewhat sensible results for french lang. fine-tuned it on FQuAD (french version of SQuAD) for que gen and BLUE-4 against dev set was 15.
Not sure if it’s an issue or not but in some cases using label_smoothing in T5 resulted in nan loss
Training with AdaFactor works quite well for me so far. I use the “constant LR 0.001” recommended in all of Colin Raffel’s finetuning paper and other AdaFactor settings from original Noam Shazeer paper.
Fair-Seq’s AdaFactor implementation is good – except you need to turn auto-scaling options off – no idea why they are on by default in the init.
This even “works” in FP16 – but don’t get me started on Native AMP quite yet…
But in summary – I would strongly recommend using AdaFactor and not ADAM for T5 training and finetuning.
this is what the T5 authors use themselves
AdaFactor was developed specifically with Transformers/T5 in mind (say so in the paper)
ADAM is a massive waste of memory in general; it’s not surprising that something more efficient would work as well unless you have custom additions to your model
Hi guys!
I just finish training T5-large on ELI5 on 270,000 exampels using TPU V2-8 on colab modified from @valhalla notebook! This is not really finetuning tips, but some tips to make T5-large trainable on TPU V2-8 .
T5-large is challenging to train on TPU V2-8 with Pytorch (for me)
I faced a lot of memory problem (even on Colab High-RAM instance), this notebook of Davide Libenzi - one of XLA authors suggested to declare large model outside _mp_fn (see his mx variable )
with T5-base , there is around 7 minutes overhead before training can start, for T5-large, this takes 1 hour overhead to me
with max_length = 128 (both input and target), I am able to set per_device_train_batch_size = 4 (so, global_batch_size = 4*8 = 32)
there is an issue that xm.save() causes memory error with large models like XLM-Roberta , it happen to T5-large too, so I have to ignore the default save_steps of Trainer by setting it to 1000000
Combine all these, took me around 1 day before I can make a trainable notebook, so hopefully these tricks can be useful to some of you guys too!
I would like to find time to make a TF2 version which should be more stable on TPU
More note
As @valhalla mentioned in his notebook, High-RAM instance is a must. Lately Kaggle notebook increased RAM to 16GB for TPUV3-8, but I could not the training to success (sadly since V3-8 should be 2x faster than V2-8)
Hi @moscow25, thanks for sharing the AdaFactor info. I was wondering if constant LR of 1e-3 is working for small batch sizes, because in the paper they mentioned that the BS for fine-tuning was 128, it’s not possible to use 128 BS with single V100 for model >t5-base .
This trick of loading the model outside of _map_fn is awesome! It should save some memory. In pytorch-xla the model and the datset is loaded in all processes (8 in case 8 TPU cores) so it ends up taking lot of memory. Lazy loading dataset should also reduce RAM usage.
On V3-8, I was able to use bs of 8 per device with max_source_length 512 and max_target_length 64
Sure thing @valhalla. I did not try too many settings… but LR 0.001 seems to work just fine for smaller finetuning batches. I’m running global batch of 2*8 [2 per GPU] – though with a bit of gradient accumulation (4x I believe) but tbh it’s not really that sensitive as far as I can tell. The only gotcha is to turn off those extra scaling parameters that FAIR-seq threw in there and set True by default for no good reason. (scale_parameter=False, relative_step=False)
To get bigger batches, I’m pretty sure we need to add some gradient checkpointing to the model. Trying that out next…
T5 questions I think I know the answer to that multiple people have asked. Correct me if I’m wrong! Quotes are from the paper.
Q: What masking objective did they use for pretraining?
Span Corruption.
Specifically, we use a mean span length of 3 and corrupt 15% of the original sequence. We found that this objective produced marginally better performance (Table 7) while being slightly more computationally efficient due to shorter target sequence lengths.
Q: Are the hf checkpoints trained with multi-tasking?
A: yes
More on T5 pre-training objective
Each corrupted span is replaced by a unique sentinel token. . The
output sequence then consists of the dropped-out spans, delimited by the sentinel
tokens used to replace them in the input plus a final sentinel token.
T5 uses 100 extra ids as sentinel tokens (<extra_id_0> ... <extra_id_99>)
from HF docs
output sequence is formed as a concatenation of the same sentinel tokens and the real masked tokens
E.g. the sentence “The cute dog walks in the park” with the masks put on “cute dog” and “the” should be processed as follows:
input_text = "The <extra_id_1> walks in <extra_id_2> park"
target_text = "<extra_id_1> cute dog <extra_id_2> the <extra_id_3> </s>"
As for FP16 – I did manage to get T5-Large checkpoints training in FP16… actually it won’t even inference (get NaN) for T5-Large without turning off feed-forward in FP16 or by doing some custom scaling magic…
Even if you do only do FP16 on the attention operations and that’s it. I am able to fine-tune a checkpoint without NaN’s but the model diverges after a while. This happens even with just a small part of the model in FP16, and with a lower LR.
It’s possible that this can eventually work, but it’s not simple.
Moreover – if you are doing your attention operations in FP16 but saving all weights and gradients in FP32 (as well as FP16) – this may save a little bit of compute but does not save GPU memory at training. So this kind of very conservative FP16 is not useful, sad to say:
Diverges the model
Does not use less GPU memory (so can’t use larger batches or bigger model)
What do you mean by “does work?” you clearly say that it made your results worse.
From what I saw over T5-small “scale parameter” makes the result slightly worse as well.
I’m currently trying T5-3B and the model doesn’t learn
(beta1=0.9 and scale_parameter=False) and default learning rate, so I wonder what can be done to mitigate it.
There seems to be discussion (various threads and git issues) about whether T5 arch is just inherently unstable and that frequent FP16 NaN isn’t a bug in the transformers implementation or in user training arguments but may be unavoidable in true FP16 mode.
I would like to bring into this discussion that when i run the mesh tensorflow version of T5 from the research repo (https://github.com/google-research/text-to-text-transfer-transformer) on TPU on my data set its rock solid 16 bit training (I assume because of the wider range capability of bf16 support). On the same data set I essentially can never get fp16 working on anything larger than t5-small with HuggingFace (with adafactor, with and without lr warming, native/apex(1/2/3) ect)
For workflow reasons using the research mesh code is not going to be an option and I need to get the 3B model training on GPUs which will require ~16bit compute in order to fit in ~32-48GB gpu.
Ampere gpus supports a similar size TF32 (18bit vs 16 bit for BF16 on TPU). Has anyone tried (or even have access to) an A100 GPU to see if TF32 solves the issue here?
EDIT: it looks like Ampere also natively support BF16. So that looks like a good way to compare T5 mesh on TPU with T5 HF on Ampere both using BF16.
if the task is not related to “summarization” then it’ll probably mess thing up or slow down convergence, because the model will think it’s doing summarization because of the prefix. better to remove it.
Just iterating what I’ve said above. If your task is completely new and not related to one of the tasks on which T5 was trained then the prefix shouldn’t matter.