Wondering if there is a way to convert a dataset downloaded using load_dataset to pandas?
Hi,
we have a method for that - Dataset.to_pandas. However, note that this will load the entire dataset into memory by default to create a DataFrame. If your dataset is too big to fit in RAM, load it in chunks as follows:
dset = load_dataset(...)
for df in dset.to_pandas(batch_size=..., batched=True):
    # process dataframes
Another option is to use the pandas formatter, which will return a DataFrame object each time the dataset is indexed/sliced:
dset = load_dataset(...)
dset.set_format("pandas")
dset[10]  # returns a dataframe with 1 row
dset[10:30]  # returns a dataframe with 20 rows
              
              
              2 Likes
            
            
          Just a little add-on: if the Huggingface Dataset consists of train and test data, after the set_format  you should write for example:
dataset['train'][10]
Otherwise, it returns Invalid key error.
Got the below error when trying the above method
AttributeError: ‘DatasetDict’ object has no attribute ‘to_pandas’