I want to do a multilingual token regression task where the result differs depending on an external language involved. So basically I want to model properties between languages by only having access to the original source text. For instance, if I have an English source text and I want to predict a value for each token as a means to quantify how that token relates to French, then that should be different from when it relates to German.
T5 seems like a good candidate here since it was already pretrained on the translation task with prefixes like "translation French to German: ", so this should already contain a lot of information. I have some questions about this:
- Are these prefixes “special characters” that are not tokenised by the tokenizer (like
<s>etc.) or can they be any string, which is then tokenised? Is there anything “special” about the prefix?
- If there is nothing special about the string, can I assume that if I change the prefix from
translate French to Germanto
French to German, that part of the pretrained model is still taken into account (since it “recognises” the languages at the start), or would they really need to be in the same position?
If they have to be in the same position, can I just use them as they were pretrained “translate French to German” and simply add a regression head instead of LM head?see edit below
Thanks a lot for your time! I really need to dig further in the models that were introduced after RoBERTa but, you know how it goes, life got in the way. So it’s nice that there’s a place here to ask some questions!
EDIT: of course T5 is text-to-text so I should not add a specific regression head. I’ll have to dig deeper how you evaluate on a regression task then, though. It seems very counter-intuitive to evaluate a regression model as a generation model. So if you have more information on using T5 for token regression, that’s welcome as well.