Custom Dataset, avoid doubling data (reuse encodings)

Hello

Iā€™m in the process of making a custom dataset for fine-tuning LayoutLM on Question & Answer. Each page can have over 60 questions asked on it, same questions over and over. The practice as far as I see from examples of QA the question and the context is one combined filed in all datasets. It seems like a waste of memory and time in my case?

My question: Is it it a bad idea to make a custom getitem(self, idx): function in the dataset which combines the encoded question with the encoded concept at runtime. This would save memory, and depending on how you do it also encoding and tokenizing time. But maybe there are some unforeseen consequences for batching, multiprocessing, apache arrow implementation or what not?

Does somebody see the clear picture here?

Yes, overriding __getitem__ is not a good idea.

If the total number of questions is small, itā€™s best to use datasets.Sequence(datasets.ClassLabe(names=list_of_all_questions)) to store them as integers (one int64 per question).

Thanks for the reply, especially the warning, wonā€™t override getitemā€¦

But I donā€™t understand your answer. I thought that the Dataset features for LayoutLMForQuestionAnswering had to be:

features = Features({
ā€˜input_idsā€™: Sequence(feature=Value(dtype=ā€˜int64ā€™)),
ā€˜bboxā€™: Array2D(dtype=ā€œint64ā€, shape=(512, 4)),
ā€˜attention_maskā€™: Sequence(Value(dtype=ā€˜int64ā€™)),
ā€˜token_type_idsā€™: Sequence(Value(dtype=ā€˜int64ā€™)),
ā€˜start_positionsā€™: Value(dtype=ā€˜int64ā€™),
ā€˜end_positionsā€™: Value(dtype=ā€˜int64ā€™),
})

Otherwise the getitme wonā€™t work and the input for training would be wrong? Where input_ids contains both the question and the answer and the attention mask marks what is question and what is the context. Am I totally off here?

I misunderstood your initial question.

You can have two datasets, one for questions (with length 60) and one for answers (same length as the original dataset), tokenize each, and call set_transform on the answers datasets with a transform that combines the tokenized question-answer pairs.

Thanks, will try this. If I get you (and the library) then this is about the same things as overriding getitem but there is a system in place for it which makes the dataset combining before the getitem is called instead of in getitem which was my original guess?

It would end up like:

def transform_encode(batch):
    .... # Bunch code of combining tensors from different datasets.
    return ds     # With features as above reply.

answer_ds.set_transform(transform_encode)

Yes, thatā€™s correct!

1 Like