Wondering if there is a way to convert a dataset downloaded using load_dataset to pandas?
Hi,
we have a method for that - Dataset.to_pandas
. However, note that this will load the entire dataset into memory by default to create a DataFrame. If your dataset is too big to fit in RAM, load it in chunks as follows:
dset = load_dataset(...)
for df in dset.to_pandas(batch_size=..., batched=True):
# process dataframes
Another option is to use the pandas
formatter, which will return a DataFrame object each time the dataset is indexed/sliced:
dset = load_dataset(...)
dset.set_format("pandas")
dset[10] # returns a dataframe with 1 row
dset[10:30] # returns a dataframe with 20 rows
3 Likes