Is there a way to load data as a Pandas DataFrame and split it into a training and validation split?
I tried this and it didn’t work.
split = (datasets.Split.TRAIN + datasets.Split.TEST).subsplit(datasets.percent[:20])
dataset = Dataset.from_pandas(df,split=split)
So when you import a dataset from pandas you turn it into a
DatasetDict. When loading datasets with these splits, you need to make sure the dataset has it’s own script that loads those splits. So I’d rather suggest you to split your pandas DataFrames, and then convert them into separate
DatasetDicts and work on them.
Still, I’d like to ping @lhoestq if there’s a better solution than mine.
Let me know if this works.
Hi ! You can try
datasets = Dataset.from_pandas(df).train_test_split(test_size=0.2)
train_dataset = datasets["train"]
val_dataset = datasets["test"]