In my experiment, I concatenate one question with each of 10 possible answers to generate a QA pair so that a language modeling model can directly evaluate the perplexity of each answer. Therefore, each item in a dataset has a input_ids
of shape 10(n_answer) * 64(seq length).
However, when I use datasets.set_format('torch')
, each item becomes a list of tensor of shape (64,)
.
Is there a way to let it return a matrix?
The code to reproduce the problem:
import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer
def tokenize_func(examples, tokenizer):
# Create question answer pair for every choice
choices = examples['choices']
duplicated_question = [examples['question']] * len(choices)
tokenized = tokenizer(duplicated_question,
choices,
padding='max_length', # batch size has to be 1
max_length=64,
truncation=True,
return_token_type_ids=True,
return_tensors='np')
return tokenized
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
val_dataset = load_dataset('drt/kqa_pro', 'train_val', split='validation[:1%]')
val_dataset = val_dataset.map(tokenize_func, fn_kwargs={'tokenizer': tokenizer})
val_dataset = val_dataset.remove_columns(['question', 'sparql', 'program', 'choices', 'answer'])
val_dataset.set_format('torch')
Moreover, even if I manurally stack the list of tensors, the items remain to be a list of tensor
val_dataset = val_dataset.map(lambda example: {k: torch.stack(v) for k, v in example.items()})
This looks very weird for me.