I’m trying to train a t5 based LM head model (mrm8488/t5-base-finetuned-wikiSQL
) using my custom data to turn text into SQL (based roughly on the SPIDER dataset).
The current training loop I have is something like this:
parameters = self.model.parameters()
optimizer = AdamW(parameters, lr=1e-5) # imported from `transformers`
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=5,
num_training_steps=len(data) * nr_epochs,
)
for epoch in range(nr_epochs):
for batch in data_loader:
optimizer.zero_grad()
predictions = model(**batch)
loss = predictions[0]
loss.backward()
optimizer.step()
scheduler.step()
Note: Simplified, I don’t show early stopping, datasource creation, dl creation, some custom scheduling logic, etc. But none of that should be relevant.
Pretty standard, the batch
dictionary contains: input_ids
, attention_mask
, labels
, decoder_attention_mask
. I get the inputs_ids and attention_mask from tokenizing my input text, I get the labels and dedocer_attention_mask from tokenizing my target text (with the same tokenizer).
I tried also passing decoder_input_ids
(using the same values I used for labels
) but it results in a CUDA error (when using GPU) or a blas error (when using CPU). I tried deepcopying the tensor in case it was an issue of both this and labels pointing to the same object, nothing changes
My main question here is:
Why would this result in the yielded loss suddenly becoming nan
and the model, if .backwards
is called on that, suddenly start to predict everything as <pad>
?
Is it just that <pad>
is what the tokenizer decodes if the middle predicts “gibberish” (i.e. nan
, inf
or a very high or low number that’s not associated with any char/seq by the tokenizer)
Furthermore, usually, losses seem to become nan
after they start getting higher and higher, but in this case, the model seems to be improving until at one point a nan
drops out of nowhere.
My other questions, to hopefully help address this, are:
- Is the
decoder_attention_mask
actually theoutput_attention_mask
? The model seems to perform much better when I add it and I get it from tokenizing the target text (and it seems to overlap with the padding therein) … but, my impression was that the “decoder” here was the generator of embedding and that seq2seq models have an additional LM head. Am I just getting my terminology wrong? Is the argument just named poorly? - Is there any relevance to passing
decoder_input_ids
? Should these just be equivalent to thelabels
(given that, see above, the “decoder” here seems to be referring to the LM head)? Should I consider passing them instead of passing labels? Why would I get cuda/blas related crashes when I do pass them? - My current approach is to just “ignore” a loss of
nan
, i.e. clear the gradient, don’t do backdrop, and keep moving. Is there a better alternative? Is the loss going tonan
unexpected and maybe a sign I should look for and remove a “faulty” datapoint from the batch?
I get this is an unideal way to be training, but I couldn’t get the Seq2Seq trainer working (made a question regarding that here: Extremely confusing or non-existent documentation about the Seq2Seq trainer)