Do you train all layers when fine-tuning T5?

hyura · September 8, 2020, 4:04am

From my beginner-level understanding, when it comes to BERT, sometimes people will train just the last layer, or sometimes they’ll train all layers a couple epochs and then the last layer a few more epochs.

Does T5 have any similar practices? Or is it normal to just train the whole thing when fine-tuning?

And very tangentially related: to fine-tune T5, we just do loss.backward() on the result of the forward() call of the T5 model right (the loss key in the returned dict)? So there’s no need to calculate any loss on our own?

valhalla · September 8, 2020, 6:04am

I haven’t seen much experiments for this, but IMO it’s better to fine-tune the whole model.

Also when you pass labels argument to T5ForConditionalGeneration's forward method then it calculates the loss for you and returns it as the first value in the returned tuple .

And you can use the finetune.py script here to fine-tuning T5 and other seq2seq models

See this thread T5 Finetuning Tips

ivankrstev7 · September 29, 2021, 12:23pm

Hi,
I am trying to fine tune T5 for conditional text generation, and it works perfectly. But, I would like to have more control over the layers I am training i.e. freeze and unfreeze certain layers. For example, I am also using GPT2 for the same use case and I do something like this:

# - Freeze selective layers:
# - Freeze all layers except last n:
for parameter in model.parameters():
    parameter.requires_grad = False

for i, m in enumerate(model.transformer.h):        
    #Only un-freeze the last n transformer blocks
    if i+1 > 12 - config['UNFREEZE_LAST_N']:
        for parameter in m.parameters():
            parameter.requires_grad = True 

for parameter in model.transformer.ln_f.parameters():        
    parameter.requires_grad = True

for parameter in model.lm_head.parameters():        
    parameter.requires_grad = True

Now, I am able to freeze all the layers in T5, but I have trouble unfreezing the last n layers, since T5ForConditionalGeneration has no attribute “transformer”. Does anyone has an idea on how to achieve this, because I couldn’t find anything about this issue?

TheLongSentance · September 30, 2021, 3:34pm

Maybe start this as a fresh topic with a more specific title to your issue - likely to stand more chance of a response? (sorry can’t help, just my experience of this forum!).

BramVanroy · October 1, 2021, 7:58am

You can print out any PyTorch model to see its architecture, or you can simply look at the source code. On the highest level you’ll find shared embeddings shared, an encoder encoder, a deocder decoder and a LM head lm_head. You can use any of those. So for instance, unfreezing the decoder transformer layers:

for i, m in enumerate(model.decoder.block):        
    #Only un-freeze the last n transformer blocks in the decoder
    if i+1 > 12 - config['UNFREEZE_LAST_N']:
        for parameter in m.parameters():
            parameter.requires_grad = True

It is block because the decoder is a T5Stack which has its transformer layers as a ModuleList called block.

github.com

huggingface/transformers/blob/8bbb53e20b7873ba7f63be70d4d798e0c3568bfa/src/transformers/models/t5/modeling_t5.py#L811

    
      
                  return shifted_input_ids
          
          

          
class T5Stack(T5PreTrainedModel):
              def __init__(self, config, embed_tokens=None):
                  super().__init__(config)
          
          
        self.embed_tokens = embed_tokens
                  self.is_decoder = config.is_decoder
          
          
        self.block = nn.ModuleList(
                      [T5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]
                  )
                  self.final_layer_norm = T5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
                  self.dropout = nn.Dropout(config.dropout_rate)
          
          
        self.init_weights()
                  # Model parallel
                  self.model_parallel = False
                  self.device_map = None
                  self.gradient_checkpointing = False

So it isn’t hard at all, you just have to go look for the right name in the code.

ivankrstev7 · October 1, 2021, 11:56am

Thank you, this was actually very helpful

VGan · September 26, 2023, 7:48am

Do you guys know what the default behavior is? Does finetuning unfreeze all layers and weights? Or does it just unfreeze the LM head (linear projection)?

dblakely · September 26, 2023, 10:45am

The default behavior in Huggingface is to fine-tune the whole model, there’s no freezing. If you want to freeze layers, you actually have to write some code yourself to do that.

Topic		Replies	Views
Freezing mt5 model for fine-tuning Models	1	479	July 15, 2023
Fine-tuning T5 with Trainer for novel task Models	1	1152	September 1, 2021
Errors when fine-tuning T5 Beginners	7	6469	January 3, 2022
The point of using pretrained model if I don't freeze layers Beginners	1	8494	May 31, 2023
T5 Seq2Seq custom fine-tuning Models	7	3712	November 30, 2020

Do you train all layers when fine-tuning T5?

Related topics