Hi, I have as specific task for which I’d like to use T5.
Inputs look like
some words <SPECIAL_TOKEN1> some other words <SPECIAL_TOKEN2>
Training Outputs are a certain combination of the (some words) and (some other words). The goal is to have T5 learn the composition function that takes the inputs to the outputs, where the output should hopefully be good language.
I was hoping to find an example script that I could modify. In particular I need a little help understanding how to do these parts:
When generating the input files (i.e. the mapping from input_str to output_str) what is the best format (e.g. a tsv for input and a tsv for output with a 1:1 mapping by line)?
Add special tokens to the vocab. Assuming that my inputs have special tokens and in the input files, then to make the model recognize them, I think I should use something like transformers.T5Tokenizer(additional_special_tokens=[, ]). Is this correct?
Additional input processing: I think I need to somehow prepend a new “task tag” to all the input-output pairs. Where would I specify this new task name?
Do I need to register this task somewhere so that it can actually be executed? Some of the examples I saw seem to suggest that I do. And do I need to choose a loss function for my new task? (If I don’t, will one be selected automatically?)
Any tips for the loss function? I care about the outputs being syntactical /grammatical, but I would also like the model to learn the relative positional relations of the inputs.
For example, if I had something like
a b c , the model might learn that abc, bac, cab, or cba are valid (i.e. in this case “a” and “b” must always be adjacent), and would choose the sequence that is most probable under the language model.
What would be a good reason for not re-using finetune.py? Or in the case that someone wants more customization, it would make sense to just tweak the finetune.py file?
Yesterday it took me a while to find the directory examples/seq2seq. I’m curious: why aren’t the modules (e.g. SummarizationModule) available in a higher level transformers directory? (e.g. something like transformers/seq2seq or transformers/LMgeneration)?
You should add max_length=None to your model.generate() call, I think. If that doesn’t work, try max_length=500 or something and see if generations are longer. I think you should also set min_length=None.
So if you want to see what the model is being loaded with when we do .frompretrained(), call print(model.config). I think we’ll see that the default is max_length=20, which would be causing your problem. Set both max length and min length to None, and then the model will stop only when EOS token is the most probable output.
I think you could also directly modify some of these config parameters at load, e.g. by model.config.max_length = new_value, rather than doing it at the generation call.
Hey all, I have been trying to finetune T5 on XSum and I am getting constant validation loss. It doesn’t change at all. The training loss varies a but doesn’t converge like it stays in the range [10.0, 12.0]. I tried many methods like creating my own nn.Module which compatible with Trainer(), etc but none worked. Link to colab (first version where I used default Trainer()).
Can anyone share a colab link or wandb project for my reference?
I am finetuning a T5 model for QA on my dataset but the vocab is so different than the tokenizer’s, which results in an excessive length of token_ids/tokens. can I train a new tokenizer from the existing one and use it for finetuning? if yes, any tips/resources to aid?