Do you train all layers when fine-tuning T5?

From my beginner-level understanding, when it comes to BERT, sometimes people will train just the last layer, or sometimes they’ll train all layers a couple epochs and then the last layer a few more epochs.

Does T5 have any similar practices? Or is it normal to just train the whole thing when fine-tuning?

And very tangentially related: to fine-tune T5, we just do loss.backward() on the result of the forward() call of the T5 model right (the loss key in the returned dict)? So there’s no need to calculate any loss on our own?

I haven’t seen much experiments for this, but IMO it’s better to fine-tune the whole model.

Also when you pass labels argument to T5ForConditionalGeneration's forward method then it calculates the loss for you and returns it as the first value in the returned tuple .

And you can use the script here to fine-tuning T5 and other seq2seq models

See this thread T5 Finetuning Tips


I am trying to fine tune T5 for conditional text generation, and it works perfectly. But, I would like to have more control over the layers I am training i.e. freeze and unfreeze certain layers. For example, I am also using GPT2 for the same use case and I do something like this:

# - Freeze selective layers:
# - Freeze all layers except last n:
for parameter in model.parameters():
    parameter.requires_grad = False

for i, m in enumerate(model.transformer.h):        
    #Only un-freeze the last n transformer blocks
    if i+1 > 12 - config['UNFREEZE_LAST_N']:
        for parameter in m.parameters():
            parameter.requires_grad = True 

for parameter in model.transformer.ln_f.parameters():        
    parameter.requires_grad = True

for parameter in model.lm_head.parameters():        
    parameter.requires_grad = True

Now, I am able to freeze all the layers in T5, but I have trouble unfreezing the last n layers, since T5ForConditionalGeneration has no attribute “transformer”. Does anyone has an idea on how to achieve this, because I couldn’t find anything about this issue?

Maybe start this as a fresh topic with a more specific title to your issue - likely to stand more chance of a response? (sorry can’t help, just my experience of this forum!).

You can print out any PyTorch model to see its architecture, or you can simply look at the source code. On the highest level you’ll find shared embeddings shared, an encoder encoder, a deocder decoder and a LM head lm_head. You can use any of those. So for instance, unfreezing the decoder transformer layers:

for i, m in enumerate(model.decoder.block):        
    #Only un-freeze the last n transformer blocks in the decoder
    if i+1 > 12 - config['UNFREEZE_LAST_N']:
        for parameter in m.parameters():
            parameter.requires_grad = True 

It is block because the decoder is a T5Stack which has its transformer layers as a ModuleList called block.

So it isn’t hard at all, you just have to go look for the right name in the code.

1 Like

Thank you, this was actually very helpful :slight_smile:

1 Like