Train_test_split with a dataset loaded from dict

kamneb · November 9, 2022, 2:01pm

Hello,

I would like to split my dataset into train and test samples. My dataset was initially created with a dict.

So it looks likes this:

from datasets import Dataset
data = {"text": ["This is a sentence"]*100, "extra_data": np.random.randint(0, 10, size=(100, 5)), "labels": np.random.randint(0, 3, size=(100,))}
ds = Dataset.from_dict(data)

However when i try to split it with:

train_test = ds.train_test_split(test_size=0.2)

I have this error message:

pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object

thanks

kamneb · November 9, 2022, 3:27pm

I updated pyarrow version and dataset version. now it works

Topic		Replies	Views
How to split Hugging Face dataset to train and test? 🤗Datasets	5	55144	January 24, 2023
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	1025	November 16, 2021
ArrowTypeError: Expected bytes, got a 'float' object, when trying to make a dataset from a list of dicts 🤗Datasets	10	10932	May 13, 2024
How do I split a Dataset with only train to train/test? Beginners	1	454	February 21, 2022
AttributeError: 'DatasetDict' object has no attribute 'train_test_split' 🤗Datasets	4	19940	August 5, 2023

Train_test_split with a dataset loaded from dict

Related topics