Wondering if there is a way to convert a dataset downloaded using load_dataset to pandas?
Hi,
we have a method for that - Dataset.to_pandas
. However, note that this will load the entire dataset into memory by default to create a DataFrame. If your dataset is too big to fit in RAM, load it in chunks as follows:
dset = load_dataset(...)
for df in dset.to_pandas(batch_size=..., batched=True):
# process dataframes
Another option is to use the pandas
formatter, which will return a DataFrame object each time the dataset is indexed/sliced:
dset = load_dataset(...)
dset.set_format("pandas")
dset[10] # returns a dataframe with 1 row
dset[10:30] # returns a dataframe with 20 rows
3 Likes
Just a little add-on: if the Huggingface Dataset consists of train and test data, after the set_format you should write for example:
dataset['train'][10]
Otherwise, it returns Invalid key error.
Got the below error when trying the above method
AttributeError: ‘DatasetDict’ object has no attribute ‘to_pandas’