Converting an HF dataset to pandas

Wondering if there is a way to convert a dataset downloaded using load_dataset to pandas?

Hi,

we have a method for that - Dataset.to_pandas. However, note that this will load the entire dataset into memory by default to create a DataFrame. If your dataset is too big to fit in RAM, load it in chunks as follows:

dset = load_dataset(...)
for df in dset.to_pandas(batch_size=..., batched=True):
    # process dataframes

Another option is to use the pandas formatter, which will return a DataFrame object each time the dataset is indexed/sliced:

dset = load_dataset(...)
dset.set_format("pandas")
dset[10]  # returns a dataframe with 1 row
dset[10:30]  # returns a dataframe with 20 rows
3 Likes

Just a little add-on: if the Huggingface Dataset consists of train and test data, after the set_format you should write for example:
dataset['train'][10]
Otherwise, it returns Invalid key error.

Got the below error when trying the above method
AttributeError: ‘DatasetDict’ object has no attribute ‘to_pandas’