Slow Iteration speed (with and without keep_in_memory=True)

I have a dataset containing around 1.7 entries and the following structure:
ColA → tokenized text
ColB → List(3 tokenized text samples)

I was comparing different methods of iterating over the dataset using a torch DataLoader and found the HF datasets to be consistently slower.

All versions use the same collate_fn

# mydata = 100k samples
#A -> ~81sec
dataset = load_from_disk("mydata",keep_in_memory=True)
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)

#B -> ~83sec
dataset = load_from_disk("mydata",keep_in_memory=False)
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)

#C -> ~45sec
from torch.utils.data import Dataset, DataLoader
data= pd.read_parquet("mydata")

class MyDataset(Dataset):
  def __init__(self, pd_dataset):
    self.data = pd_dataset
  def __len__(self):
    return len(self.data)
  def __getitem__(self, idx):
    return {"cola":self.data.cola[idx], "colb":self.data.colb[idx]}

m =MyDataset(data)
loader = DataLoader(m, batch_size=1000, collate_fn = collate_fn)

#Testing
start_time = time.time()
for l in loader:
  pass
print("--- %s seconds ---" % (time.time() - start_time))

I am aware that the results may not be as accurate as possible but the differences should give some indication.

Questions:

  • Why is my pandas version significantly faster than iterating over the hugginface datasets?
  • Why is there no significant performance difference between keep_in_memory True/False?

To read tokenized text from Arrow, the bottleneck is often the conversion of the tokenized Arrow data to pythons lists.

It’s much faster to load them as torch tensors directly - since the data is loaded using zero-copy from your disk:

dataset = load_from_disk("mydata",keep_in_memory=False)
dataset = dataset.with_format("torch")
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)

Also make sure to use the latest versions of datasets and torch (minimum 2.10 and 1.13 to get the best speed)

2 Likes

Thank you for the reply!

I should add that the data in my dataset is tokenized but not truncated as I want to perform dynamic padding on the batches. Thus I can not convert the dataset into torch tensors of the same size. I found that in this case, the with_format(“torch”) reduces the speed.

Any suggestions on how to still improve performance?

I found that in this case, the with_format(“torch”) reduces the speed.

Oh this shouldn’t be the case, let me know if you find what’s slowing things down for you (using a profiler or stopping the iteration in the middle to check the traceback)