I have a dataset containing around 1.7 entries and the following structure:
ColA → tokenized text
ColB → List(3 tokenized text samples)
I was comparing different methods of iterating over the dataset using a torch DataLoader and found the HF datasets to be consistently slower.
All versions use the same collate_fn
# mydata = 100k samples
#A -> ~81sec
dataset = load_from_disk("mydata",keep_in_memory=True)
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)
#B -> ~83sec
dataset = load_from_disk("mydata",keep_in_memory=False)
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)
#C -> ~45sec
from torch.utils.data import Dataset, DataLoader
data= pd.read_parquet("mydata")
class MyDataset(Dataset):
def __init__(self, pd_dataset):
self.data = pd_dataset
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return {"cola":self.data.cola[idx], "colb":self.data.colb[idx]}
m =MyDataset(data)
loader = DataLoader(m, batch_size=1000, collate_fn = collate_fn)
#Testing
start_time = time.time()
for l in loader:
pass
print("--- %s seconds ---" % (time.time() - start_time))
I am aware that the results may not be as accurate as possible but the differences should give some indication.
Questions:
- Why is my pandas version significantly faster than iterating over the hugginface datasets?
- Why is there no significant performance difference between keep_in_memory True/False?