I have no problem training a transformer (like a small T5) to output the last word of certain sentences in an input with many irrelevant sentences. But a problem like the following seems to be much harder for the same model to learn.
My input-output pairs are kind of like these:
irrelevant sentences. n42. i am sentence 1. my name is sentence 2. more irrelevant sentences. out:
irrelevant sentences. x79. i am the real sentence 1. the real deal is sentence 2. you are like sentence 3. more irrelevant sentences. out:
x79. x79. x79.
So it should output a certain word as many times as the number of certain kind of sentences in the input. Would it be hard for a transformer (like T5) to learn this? Would it need to learn to “count”? Would an output like
n42. same. and
x79. same. same. be easier to learn? If yes, why?
Any insights are welcome.