Set_format('torch') returns lists of tensors for multiple-entries sample

drt · November 3, 2022, 4:03pm

In my experiment, I concatenate one question with each of 10 possible answers to generate a QA pair so that a language modeling model can directly evaluate the perplexity of each answer. Therefore, each item in a dataset has a input_ids of shape 10(n_answer) * 64(seq length).
However, when I use datasets.set_format('torch'), each item becomes a list of tensor of shape (64,).
Is there a way to let it return a matrix?

The code to reproduce the problem:

import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer

def tokenize_func(examples, tokenizer):
    # Create question answer pair for every choice
    choices = examples['choices']
    duplicated_question = [examples['question']] * len(choices)
    tokenized = tokenizer(duplicated_question,
                          choices,
                          padding='max_length',  # batch size has to be 1
                          max_length=64,
                          truncation=True,
                          return_token_type_ids=True,
                          return_tensors='np')
    return tokenized

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
val_dataset = load_dataset('drt/kqa_pro', 'train_val', split='validation[:1%]')
val_dataset = val_dataset.map(tokenize_func, fn_kwargs={'tokenizer': tokenizer})
val_dataset = val_dataset.remove_columns(['question', 'sparql', 'program', 'choices', 'answer'])
val_dataset.set_format('torch')

Moreover, even if I manurally stack the list of tensors, the items remain to be a list of tensor

val_dataset = val_dataset.map(lambda example: {k: torch.stack(v) for k, v in example.items()})

This looks very weird for me.

lhoestq · November 4, 2022, 10:17am

Hi ! Which version of datasets are you using ? Can you try to update to the latest version ?

drt · November 11, 2022, 10:08am

It works like a magic…
It was version 2.4.0

Topic		Replies	Views
Set the format of the datasets to return pytorch tensors return list of tensors but why? Beginners	3	3879	July 13, 2021
Returns list of tensors instead of tensors with set_format in datasets Beginners	1	671	March 8, 2022
Getting list of tensors instead of tensor array after using set_format 🤗Datasets	1	2157	November 30, 2021
Set dataset to pytorch tensors produce class list making the model unable to process the data 🤗Datasets	3	2458	July 20, 2021
The datasets.map() method doesn't keep tensor format from `tokenizer` 🤗Datasets	1	1928	November 4, 2022

Set_format('torch') returns lists of tensors for multiple-entries sample

Related topics