I have code as below. I am converting a dataset to a dataframe and then back to dataset. I am repeating the process once with shuffled data and once with unshuffled data. When I compare data in case of shuffled data, I get false. But when I compare data in case of unshuffled data, I get True. Why is there this kind of discrepancy
import pandas as pd
import datasets
import numpy as np
import os
train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'],
cache_dir='/media/data_files/github/website_tutorials/data')
train_data_s1, test_data_s1 = datasets.load_dataset('imdb', split =['train[0:500]', 'test[0:500]'],
cache_dir='/media/data_files/github/website_tutorials/data')
print (type (train_data_s1))
print (type (test_data_s1))
#shuffling adn taking first 500 from train_data and test_data
# Create a list in a range of 10-20
l1=[*range(0,499,1)]
# Print the list
print(l1)
train_data_s1_shuffled=train_data.shuffle(seed=2).select(l1)
test_data_s1_shuffled=test_data.shuffle(seed=3).select(l1)
print (type (train_data_s1_shuffled))
print (type (test_data_s1_shuffled))
Why do i get False below when I compare data but in the next block I get True
from datasets import Dataset
import pandas as pd
#dataset_from_pandas = Dataset.from_pandas(df_pandas)
#https://discuss.huggingface.co/t/how-to-create-custom-classlabels/13650# "basic_sentiment holds values [-1,0,1]
from datasets import ClassLabel
x1=pd.DataFrame(train_data_s1_shuffled)
x2=Dataset.from_pandas(x1).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))
print (x2.data==train_data_s1_shuffled.data)#returns false
print (x2.features==train_data_s1_shuffled.features)
#not sure why data matches in this case but not in earlier case?
x11=pd.DataFrame(train_data)
#print (x11.head())
x21=Dataset.from_pandas(x11).cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))
print (x21.data==train_data.data)
print (x21.features==train_data.features)