Split DataFrame into validation and train split

Is there a way to load data as a Pandas DataFrame and split it into a training and validation split?

I tried this and it didn’t work.

import datasets
split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:20])

dataset = Dataset.from_pandas(df,split=split)

Hello Derrick :hugs:

So when you import a dataset from pandas you turn it into a DatasetDict. When loading datasets with these splits, you need to make sure the dataset has it’s own script that loads those splits. So I’d rather suggest you to split your pandas DataFrames, and then convert them into separate DatasetDicts and work on them.
Still, I’d like to ping @lhoestq if there’s a better solution than mine.

Let me know if this works.

2 Likes

Hi ! You can try

datasets = Dataset.from_pandas(df).train_test_split(test_size=0.2)

train_dataset = datasets["train"]
val_dataset = datasets["test"]
9 Likes