From Pandas Dataframe to Huggingface Dataset

Hello everyone,

I am doing a tutorial on how to finetune pretrained Sentiment Analysis Classifier and all the finetuning part is based on a HuggingFace Dataset. Is there a way to transform a pandas Dataframe to a HuggingFace Dataset? Would help me alot with my data preprocessing…

You can have a look at here: link

2 Likes

Thanks for your help! Now it works :slight_smile:

is there a way to load this into the train split and another dataframe in memory into the validation split

None of the following options seem to do the trick:

dataset = Dataset.from_pandas(df)
dataset = Dataset.from_pandas(df, split='train')
dataset = Dataset.from_pandas(df, split=NamedSplit('train'))
dataset = Dataset.from_pandas(df, split=datasets.Split.TRAIN)
print(dataset)

The best I could come up that worked was (not sure if there is a easier/right way):

import pandas as pd
import datasets
from datasets import Dataset, DatasetDict


tdf = pd.DataFrame({"a": [1, 2, 3], "b": ['hello', 'ola', 'thammi']})
vdf = pd.DataFrame({"a": [4, 5, 6], "b": ['four', 'five', 'six']})
tds = Dataset.from_pandas(tdf)
vds = Dataset.from_pandas(vdf)


ds = DatasetDict()

ds['train'] = tds
ds['validation'] = vds

print(ds)
4 Likes

Hi @akomma ! Yes, your second approach is the correct one.