How to concatenate an answer to multiple choices after padded tokenization

drt · November 15, 2022, 2:52pm

I am evaluating a LM model on multiple choice questions. I want to dynamically concatenate each choice to the question after tokenization.
An tokenization example:
Q: How many hours are there in a day? —tokenize—> 23 67 52 13 78 [PAD] [PAD] [PAD] [PAD] [PAD]
C1: 12 hours. —tokenize–> 312 [PAD] [PAD]
C2: 24 hours. —tokenize—> 89 [PAD] [PAD]
I’d like to make two question choice pairs:
23 67 52 13 78 312 [PAD] [PAD] [PAD] [PAD]
23 67 52 13 78 89 [PAD] [PAD] [PAD] [PAD]

I currently do this at tokenization time – I concatenate the texts and feed to tokenizers.
But this is very inconvenient when, i.e. I also want question and answers separately. Also, I think this is clumsy and will cause extra GPU memory consumption and extra space to save tokenized dataset on disk.

Therefore, is there an elegant way to do this concatenation after tokenization?

Topic		Replies	Views
Question-Answering/Text-generation/Summarizing: Fine-tune on multiple answers Beginners	8	5277	November 20, 2021
Working of MultipleChoiceModel Intermediate	0	350	October 30, 2020
Multiple choice with variable length options 🤗Transformers	1	791	October 29, 2020
Fine-tuning a masked language model Beginners	0	355	February 2, 2022
Do we have to tokenize the question and context together for Q&A models? Beginners	0	227	March 13, 2022

How to concatenate an answer to multiple choices after padded tokenization

Related topics