Hi, I have as specific task for which I’d like to use T5.
Inputs look like
some words <SPECIAL_TOKEN1> some other words <SPECIAL_TOKEN2>
Training Outputs are a certain combination of the (some words) and (some other words). The goal is to have T5 learn the composition function that takes the inputs to the outputs, where the output should hopefully be good language.
I was hoping to find an example script that I could modify. In particular I need a little help understanding how to do these parts:
When generating the input files (i.e. the mapping from input_str to output_str) what is the best format (e.g. a tsv for input and a tsv for output with a 1:1 mapping by line)?
Add special tokens to the vocab. Assuming that my inputs have special tokens and in the input files, then to make the model recognize them, I think I should use something like transformers.T5Tokenizer(additional_special_tokens=[, ]). Is this correct?
Additional input processing: I think I need to somehow prepend a new “task tag” to all the input-output pairs. Where would I specify this new task name?
:
Do I need to register this task somewhere so that it can actually be executed? Some of the examples I saw seem to suggest that I do. And do I need to choose a loss function for my new task? (If I don’t, will one be selected automatically?)
Any tips for the loss function? I care about the outputs being syntactical /grammatical, but I would also like the model to learn the relative positional relations of the inputs.
For example, if I had something like
a b c , the model might learn that abc, bac, cab, or cba are valid (i.e. in this case “a” and “b” must always be adjacent), and would choose the sequence that is most probable under the language model.
You can choose whatever format that works well for you, only thing to note is your dataset or collatorshould return input_ids, attention_mask and labels.
To add new tokens
tokenizer.add_tokens(list_of_new_tokens)
# resize the embeddings
model.resize_token_embeddings(len(tokenizer))
Using task prefix is optional.
No, you won’t need to register the task, the original T5 repo requires that but it’s not required here.
What would be a good reason for not re-using finetune.py? Or in the case that someone wants more customization, it would make sense to just tweak the finetune.py file?
Yesterday it took me a while to find the directory examples/seq2seq. I’m curious: why aren’t the modules (e.g. SummarizationModule) available in a higher level transformers directory? (e.g. something like transformers/seq2seq or transformers/LMgeneration)?
You should add max_length=None to your model.generate() call, I think. If that doesn’t work, try max_length=500 or something and see if generations are longer. I think you should also set min_length=None.
So if you want to see what the model is being loaded with when we do .frompretrained(), call print(model.config). I think we’ll see that the default is max_length=20, which would be causing your problem. Set both max length and min length to None, and then the model will stop only when EOS token is the most probable output.
edit:
I think you could also directly modify some of these config parameters at load, e.g. by model.config.max_length = new_value, rather than doing it at the generation call.
Hey all, I have been trying to finetune T5 on XSum and I am getting constant validation loss. It doesn’t change at all. The training loss varies a but doesn’t converge like it stays in the range [10.0, 12.0]. I tried many methods like creating my own nn.Module which compatible with Trainer(), etc but none worked. Link to colab (first version where I used default Trainer()).
Can anyone share a colab link or wandb project for my reference?
Hi,
I am finetuning a T5 model for QA on my dataset but the vocab is so different than the tokenizer’s, which results in an excessive length of token_ids/tokens. can I train a new tokenizer from the existing one and use it for finetuning? if yes, any tips/resources to aid?
Thanks
my subject is that, I have some challenges and for each challenge I have a related solution. I want to do solution generation. means the input would be challenge and the output would be solution. I applied gpt-neo which the results is not very well. do you think t5 conditional generation can be a good try? I am not familiar what condition means here? thanks for your guidence. Iam thinking about adding some condition but don’t know if T5 is a good option? what is the most application of T5?
my subject is that, I have some challenges and for each challenge I have a related solution. I want to do solution generation. means the input would be challenge and the output would be solution. I applied gpt-neo which the results is not very well. do you think t5 conditional generation can be a good try? I am not familiar what condition means here? thanks for your guidence. Iam thinking about adding some condition but don’t know if T5 is a good option?
I am also a beginner starting to use T5 for question answering but not extractive from my own dataset. I have it loaded into a data set with two keys “inputs” - which make up context and questions and “targets” which are the answers. Do I need a labels and index section and also I am a little unclear on where to go from here any advice or tips would be greatly appreciated.