Hi there
I want to do a multilingual token regression task where the result differs depending on an external language involved. So basically I want to model properties between languages by only having access to the original source text. For instance, if I have an English source text and I want to predict a value for each token as a means to quantify how that token relates to French, then that should be different from when it relates to German.
T5 seems like a good candidate here since it was already pretrained on the translation task with prefixes like "translation French to German: ", so this should already contain a lot of information. I have some questions about this:
- Are these prefixes “special characters” that are not tokenised by the tokenizer (like
<s>
etc.) or can they be any string, which is then tokenised? Is there anything “special” about the prefix? - If there is nothing special about the string, can I assume that if I change the prefix from
translate French to German
toFrench to German
, that part of the pretrained model is still taken into account (since it “recognises” the languages at the start), or would they really need to be in the same position? -
If they have to be in the same position, can I just use them as they were pretrained “translate French to German” and simply add a regression head instead of LM head?see edit below
Thanks a lot for your time! I really need to dig further in the models that were introduced after RoBERTa but, you know how it goes, life got in the way. So it’s nice that there’s a place here to ask some questions!
EDIT: of course T5 is text-to-text so I should not add a specific regression head. I’ll have to dig deeper how you evaluate on a regression task then, though. It seems very counter-intuitive to evaluate a regression model as a generation model. So if you have more information on using T5 for token regression, that’s welcome as well.