Custom Dataset, avoid doubling data (reuse encodings)

Maxilirator · June 21, 2023, 7:44am

Hello

I’m in the process of making a custom dataset for fine-tuning LayoutLM on Question & Answer. Each page can have over 60 questions asked on it, same questions over and over. The practice as far as I see from examples of QA the question and the context is one combined filed in all datasets. It seems like a waste of memory and time in my case?

My question: Is it it a bad idea to make a custom getitem(self, idx): function in the dataset which combines the encoded question with the encoded concept at runtime. This would save memory, and depending on how you do it also encoding and tokenizing time. But maybe there are some unforeseen consequences for batching, multiprocessing, apache arrow implementation or what not?

Does somebody see the clear picture here?

mariosasko · June 21, 2023, 1:41pm

Yes, overriding __getitem__ is not a good idea.

If the total number of questions is small, it’s best to use datasets.Sequence(datasets.ClassLabe(names=list_of_all_questions)) to store them as integers (one int64 per question).

Maxilirator · June 22, 2023, 7:56am

Thanks for the reply, especially the warning, won’t override getitem…

But I don’t understand your answer. I thought that the Dataset features for LayoutLMForQuestionAnswering had to be:

features = Features({
‘input_ids’: Sequence(feature=Value(dtype=‘int64’)),
‘bbox’: Array2D(dtype=“int64”, shape=(512, 4)),
‘attention_mask’: Sequence(Value(dtype=‘int64’)),
‘token_type_ids’: Sequence(Value(dtype=‘int64’)),
‘start_positions’: Value(dtype=‘int64’),
‘end_positions’: Value(dtype=‘int64’),
})

Otherwise the getitme won’t work and the input for training would be wrong? Where input_ids contains both the question and the answer and the attention mask marks what is question and what is the context. Am I totally off here?

mariosasko · June 26, 2023, 10:10am

I misunderstood your initial question.

You can have two datasets, one for questions (with length 60) and one for answers (same length as the original dataset), tokenize each, and call set_transform on the answers datasets with a transform that combines the tokenized question-answer pairs.

Maxilirator · June 27, 2023, 7:27am

Thanks, will try this. If I get you (and the library) then this is about the same things as overriding getitem but there is a system in place for it which makes the dataset combining before the getitem is called instead of in getitem which was my original guess?

It would end up like:

def transform_encode(batch):
    .... # Bunch code of combining tensors from different datasets.
    return ds     # With features as above reply.

answer_ds.set_transform(transform_encode)

mariosasko · June 27, 2023, 12:48pm

Yes, that’s correct!

Topic		Replies	Views
Custom dataset output dimensions Beginners	0	596	May 15, 2022
Question answering bot: fine-tuning with custom dataset Beginners	6	6019	June 23, 2022
Fine tunning QA model in SQUAD 2 dataset with more than one answer Intermediate	2	880	November 6, 2024
Defining a custom dataset for fine-tuning translation Beginners	4	5083	July 10, 2021
Finetuning T5 on Squad 🤗Transformers	1	569	November 29, 2023

Custom Dataset, avoid doubling data (reuse encodings)

Related topics