I am evaluating a LM model on multiple choice questions. I want to dynamically concatenate each choice to the question after tokenization.
An tokenization example:
Q: How many hours are there in a day? —tokenize—> 23 67 52 13 78 [PAD] [PAD] [PAD] [PAD] [PAD]
C1: 12 hours. —tokenize–> 312 [PAD] [PAD]
C2: 24 hours. —tokenize—> 89 [PAD] [PAD]
I’d like to make two question choice pairs:
23 67 52 13 78 312 [PAD] [PAD] [PAD] [PAD]
23 67 52 13 78 89 [PAD] [PAD] [PAD] [PAD]
I currently do this at tokenization time – I concatenate the texts and feed to tokenizers.
But this is very inconvenient when, i.e. I also want question and answers separately. Also, I think this is clumsy and will cause extra GPU memory consumption and extra space to save tokenized dataset on disk.
Therefore, is there an elegant way to do this concatenation after tokenization?