Dataset to pandas dataframe and back to dataset

nitempe · February 16, 2022, 8:02pm

I have code as below. I am converting a dataset to a dataframe and then back to dataset. I am repeating the process once with shuffled data and once with unshuffled data. When I compare data in case of shuffled data, I get false. But when I compare data in case of unshuffled data, I get True. Why is there this kind of discrepancy

import pandas as pd

import datasets





import numpy as np



import os

train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'], 

                                             cache_dir='/media/data_files/github/website_tutorials/data')

train_data_s1, test_data_s1 = datasets.load_dataset('imdb', split =['train[0:500]', 'test[0:500]'], 

                                             cache_dir='/media/data_files/github/website_tutorials/data')

print (type (train_data_s1))

print (type (test_data_s1))

#shuffling adn taking first 500 from train_data and test_data

# Create a list in a range of 10-20

l1=[*range(0,499,1)]

  

# Print the list

print(l1)

train_data_s1_shuffled=train_data.shuffle(seed=2).select(l1)

test_data_s1_shuffled=test_data.shuffle(seed=3).select(l1)

print (type (train_data_s1_shuffled))

print (type (test_data_s1_shuffled))

Why do i get False below when I compare data but in the next block I get True

from datasets import Dataset

import pandas as pd

#dataset_from_pandas = Dataset.from_pandas(df_pandas)

#https://discuss.huggingface.co/t/how-to-create-custom-classlabels/13650# "basic_sentiment holds values [-1,0,1]

from datasets import ClassLabel

x1=pd.DataFrame(train_data_s1_shuffled)

x2=Dataset.from_pandas(x1).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))

print (x2.data==train_data_s1_shuffled.data)#returns false

print (x2.features==train_data_s1_shuffled.features)

#not sure why data matches in this case but not in earlier case?

x11=pd.DataFrame(train_data)

#print (x11.head())

x21=Dataset.from_pandas(x11).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))

print (x21.data==train_data.data)

print (x21.features==train_data.features)

lhoestq · February 17, 2022, 5:27pm

Hi ! When you call ds.data you obtain the full table of data, the same as even before calling ds.select(). That’s because ds.select() doesn’t change the underlying table of data itself, but it rather adds a mapping of indices on top of it (to map between ds[idx] and the right row in ds.data).

However, you can actually update the underlying table to only contain the indices you passed to ds.select if you want. To do so, you can use ds = ds.flatten_indices() - note that this can be expensive for big datasets since it creates a whole new table of data.

nitempe · February 17, 2022, 5:29pm

could you show your suggestion in the code? thanks!

lhoestq · February 17, 2022, 5:46pm

You can do this:

print (x2.data==train_data_s1_shuffled.flatten_indices().data)

nitempe · February 17, 2022, 5:57pm

instead of doing it there is there any way I could modify the way i write train_data_s1_shuffled?

may be something like? - train_data_s1_shuffled=train_data.shuffle(seed=2).select(l1).flatten_indices()

lhoestq · February 23, 2022, 2:17pm

yes that works indeed

Topic		Replies	Views
From Pandas Dataframe to Huggingface Dataset Beginners	9	66902	December 20, 2024
Calling shuffle on an `IterableDataset` converts float32 to float64 🤗Datasets	0	129	December 28, 2023
Converting an HF dataset to pandas Beginners	3	5683	June 17, 2024
Huggingface datasets convert a dataset to pandas and then convert it back Beginners	5	41445	May 6, 2022
Behavior of shuffled parquet dataset 🤗Datasets	1	96	November 30, 2024

Dataset to pandas dataframe and back to dataset

Related topics