I have a tokenized dataset where each sample is chunk of 8192 tokens. This sample may contain multiple original dataset’s samples if they are small or chunk of one original dataset’s sample if it is big. Some original dataset’s samples are in format “just a text without anything<END_OF_TEXT>`”, but some of this dataset’s samples are in format “beginning of the text<FIX_TOKEN>end of text”. Surely, samples in both format can be in sample of 8192 tokens. I wanna do one thing: if sample is in first format, I just want to calculate normalized cross entropy loss until EOS_TOKEN, but if sample is in second format, I want to calculate loss only from <FIX_TOKEN> to <END_OF_TEXT> token.
For example, if decoded tokenized sample of 8192 tokens is "just a text without anything<END_OF_TEXT>beginning of the text<FIX_TOKEN>end of text<END OF TEXT>"
, I want to calculate loss from the beginning to the first <END_OF_TEXT>
token and from <FIX_TOKEN>
to the second <END OF TEXT>
token, where attention anyway should consider tokens of “beginning of the text”, and then calculate mean loss of this sample of 8192 tokens and to do that with inheritance from transformers Trainer class. Is it possible?