Slow Iteration speed (with and without keep_in_memory=True)

DanielNlp · March 11, 2023, 3:50pm

I have a dataset containing around 1.7 entries and the following structure:
ColA → tokenized text
ColB → List(3 tokenized text samples)

I was comparing different methods of iterating over the dataset using a torch DataLoader and found the HF datasets to be consistently slower.

All versions use the same collate_fn

# mydata = 100k samples
#A -> ~81sec
dataset = load_from_disk("mydata",keep_in_memory=True)
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)

#B -> ~83sec
dataset = load_from_disk("mydata",keep_in_memory=False)
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)

#C -> ~45sec
from torch.utils.data import Dataset, DataLoader
data= pd.read_parquet("mydata")

class MyDataset(Dataset):
  def __init__(self, pd_dataset):
    self.data = pd_dataset
  def __len__(self):
    return len(self.data)
  def __getitem__(self, idx):
    return {"cola":self.data.cola[idx], "colb":self.data.colb[idx]}

m =MyDataset(data)
loader = DataLoader(m, batch_size=1000, collate_fn = collate_fn)

#Testing
start_time = time.time()
for l in loader:
  pass
print("--- %s seconds ---" % (time.time() - start_time))

I am aware that the results may not be as accurate as possible but the differences should give some indication.

Questions:

Why is my pandas version significantly faster than iterating over the hugginface datasets?
Why is there no significant performance difference between keep_in_memory True/False?

lhoestq · March 13, 2023, 4:02pm

To read tokenized text from Arrow, the bottleneck is often the conversion of the tokenized Arrow data to pythons lists.

It’s much faster to load them as torch tensors directly - since the data is loaded using zero-copy from your disk:

dataset = load_from_disk("mydata",keep_in_memory=False)
dataset = dataset.with_format("torch")
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)

Also make sure to use the latest versions of datasets and torch (minimum 2.10 and 1.13 to get the best speed)

DanielNlp · March 13, 2023, 9:17pm

Thank you for the reply!

I should add that the data in my dataset is tokenized but not truncated as I want to perform dynamic padding on the batches. Thus I can not convert the dataset into torch tensors of the same size. I found that in this case, the with_format(“torch”) reduces the speed.

Any suggestions on how to still improve performance?

lhoestq · March 14, 2023, 3:06pm

I found that in this case, the with_format(“torch”) reduces the speed.

Oh this shouldn’t be the case, let me know if you find what’s slowing things down for you (using a profiler or stopping the iteration in the middle to check the traceback)

Topic		Replies	Views
Why is it so slow to access data through iteration with hugginface dataset? Intermediate	2	2875	July 21, 2022
Dataloader time problem on custom dataset based on huggingface Beginners	2	1044	June 14, 2022
Loading dataset from disk taking more time than expected 🤗Datasets	0	722	March 14, 2022
Accessing dataset is very slow compared to torchvision 🤗Datasets	2	1319	May 24, 2022
Fetching data takes too too much time 🤗Datasets	1	1305	June 13, 2022

Slow Iteration speed (with and without keep_in_memory=True)

Related topics