From Pandas Dataframe to Huggingface Dataset

Hello everyone,

I am doing a tutorial on how to finetune pretrained Sentiment Analysis Classifier and all the finetuning part is based on a HuggingFace Dataset. Is there a way to transform a pandas Dataframe to a HuggingFace Dataset? Would help me alot with my data preprocessing…

1 Like

You can have a look at here: link

4 Likes

Thanks for your help! Now it works :slight_smile:

is there a way to load this into the train split and another dataframe in memory into the validation split

None of the following options seem to do the trick:

dataset = Dataset.from_pandas(df)
dataset = Dataset.from_pandas(df, split='train')
dataset = Dataset.from_pandas(df, split=NamedSplit('train'))
dataset = Dataset.from_pandas(df, split=datasets.Split.TRAIN)
print(dataset)

The best I could come up that worked was (not sure if there is a easier/right way):

import pandas as pd
import datasets
from datasets import Dataset, DatasetDict


tdf = pd.DataFrame({"a": [1, 2, 3], "b": ['hello', 'ola', 'thammi']})
vdf = pd.DataFrame({"a": [4, 5, 6], "b": ['four', 'five', 'six']})
tds = Dataset.from_pandas(tdf)
vds = Dataset.from_pandas(vdf)


ds = DatasetDict()

ds['train'] = tds
ds['validation'] = vds

print(ds)
21 Likes

Hi @akomma ! Yes, your second approach is the correct one.

1 Like

Page is not there. Could you please add a page link to that again or page name ?

Look for the from_pandas method at link

this link should work

Thank you it is helpful

Hi @mariosasko! What did you mean by the second approach?

dataset = Dataset.from_pandas(df, split='train')

still doesn’t work in 2024. Do I still have to create it with something like this:

dataset = DatasetDict({"train": tds, "val": vds})

?

1 Like