Dataset to pandas dataframe and back to dataset

I have code as below. I am converting a dataset to a dataframe and then back to dataset. I am repeating the process once with shuffled data and once with unshuffled data. When I compare data in case of shuffled data, I get false. But when I compare data in case of unshuffled data, I get True. Why is there this kind of discrepancy

import pandas as pd

import datasets





import numpy as np



import os
train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'], 

                                             cache_dir='/media/data_files/github/website_tutorials/data')

train_data_s1, test_data_s1 = datasets.load_dataset('imdb', split =['train[0:500]', 'test[0:500]'], 

                                             cache_dir='/media/data_files/github/website_tutorials/data')

print (type (train_data_s1))

print (type (test_data_s1))
#shuffling adn taking first 500 from train_data and test_data

# Create a list in a range of 10-20

l1=[*range(0,499,1)]

  

# Print the list

print(l1)

train_data_s1_shuffled=train_data.shuffle(seed=2).select(l1)

test_data_s1_shuffled=test_data.shuffle(seed=3).select(l1)

print (type (train_data_s1_shuffled))

print (type (test_data_s1_shuffled))

Why do i get False below when I compare data but in the next block I get True

from datasets import Dataset

import pandas as pd

#dataset_from_pandas = Dataset.from_pandas(df_pandas)

#https://discuss.huggingface.co/t/how-to-create-custom-classlabels/13650# "basic_sentiment holds values [-1,0,1]

from datasets import ClassLabel

x1=pd.DataFrame(train_data_s1_shuffled)

x2=Dataset.from_pandas(x1).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))

print (x2.data==train_data_s1_shuffled.data)#returns false

print (x2.features==train_data_s1_shuffled.features)

#not sure why data matches in this case but not in earlier case?

x11=pd.DataFrame(train_data)

#print (x11.head())

x21=Dataset.from_pandas(x11).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))

print (x21.data==train_data.data)

print (x21.features==train_data.features)

Hi ! When you call ds.data you obtain the full table of data, the same as even before calling ds.select(). That’s because ds.select() doesn’t change the underlying table of data itself, but it rather adds a mapping of indices on top of it (to map between ds[idx] and the right row in ds.data).

However, you can actually update the underlying table to only contain the indices you passed to ds.select if you want. To do so, you can use ds = ds.flatten_indices() - note that this can be expensive for big datasets since it creates a whole new table of data.

1 Like

could you show your suggestion in the code? thanks!

You can do this:

print (x2.data==train_data_s1_shuffled.flatten_indices().data)
1 Like

instead of doing it there is there any way I could modify the way i write train_data_s1_shuffled?

may be something like? - train_data_s1_shuffled=train_data.shuffle(seed=2).select(l1).flatten_indices()

yes that works indeed :slight_smile:

1 Like