Fine-tune T5 model for Casual Language Modeling(CLM)

Dear all,
I am new to NLP and has some strange questions, I try to explain them clearly.

My goal is to using a specific corpus to fine-tune t5-base model with a casual language modeling, I find this document and it use AutoModelForCasualLM, but this liabrary just not include series of t5 models.

So my question is:

  1. How should I do to finetune t5 model for CLM object? In my understanding, CLM is a process of predicting token_2 from token_1 , token_3 from token_1, token_2 until the end of input sequence, so i am confused how to finish this process myself.

  2. I try to spilt one my train data into something like this (ti == token_i, 1 == eos_token):
    input_ids                                                     labels

  • [t1, 1, 1, 1, 1, 1, ...]         [t1, t2, 1, 1, 1, 1, ...]

  • [t1, t2, 1, 1, 1, 1, ...]        [t1, t2, t3, 1, 1, 1, ...]

  • [t1, t2, t3, 1, 1, 1, ...]       [t1, t2, t3, t4, 1, 1, ...]

  • [t1, t2, t3, t4, 1, 1, ...]      [t1, t2, t3, t4, t5, 1, ...]
    The first problem is obvious, the expanded dataset is too large and requires more time to fine-tune; The second problem is that this seems strange, and I don’t know if this fulfills the CLM’s mission requirements. This is the only idea that i can catch up to solve this problem, does it work?


As a supplement, I used ‘T5ForConditionalGeneration.from_pretrained(“t5-base”)’