Dataset.from_generator() cost much more time in vscode debugging mode then running mode

bossalex · September 22, 2023, 1:37pm

Hey there,
I’m using Dataset.from_generator() to convert a torch_dataset to the Huggingface Dataset.
However, when I debug my code on vscode, I find that it runs really slow on Dataset.from_generator() which may even 10 times longer then run the script on terminal.
Here is a simple test I tried:

import os
from functools import partial
from typing import Callable

import torch
import time
from torch.utils.data import Dataset as TorchDataset

from datasets import load_from_disk, Dataset as HFDataset
  
import torch  
from torch.utils.data import Dataset  
  
class SimpleDataset(Dataset):  
    def __init__(self, data):  
        self.data = data  
        self.keys = list(data[0].keys())
      
    def __len__(self):  
        return len(self.data)  
      
    def __getitem__(self, index):  
        sample = self.data[index]  
        return {key: sample[key] for key in self.keys}  
  
 

  

def TorchDataset2HuggingfaceDataset(torch_dataset: TorchDataset, cache_dir: str = None
) -> HFDataset:
    
    """
        convert torch dataset to huggingface dataset
    """
    generator : Callable[[], TorchDataset] = lambda: (sample for sample in torch_dataset)   

    return HFDataset.from_generator(generator, cache_dir=cache_dir)

if __name__ == '__main__':
    data = [  
        {'id': 1, 'name': 'Alice'},  
        {'id': 2, 'name': 'Bob'},  
        {'id': 3, 'name': 'Charlie'}  
    ]
    
    torch_dataset = SimpleDataset(data)
    start_time = time.time() 
    huggingface_dataset = TorchDataset2HuggingfaceDataset(torch_dataset)
    end_time = time.time()
    print("time: ", end_time - start_time)
    print(huggingface_dataset)

this test on my machine report that the running time on terminal is 0.086,
however the running time in debugging mode on vscode is 0.25, which I think is much longer than expected.

I’d like to know is the anything wrong in the code or just because of debugging?
I have traced the code and I find is this func which I get stuck.

In datasets.builder.BuilderCofing

def create_config_id(
        self,
        config_kwargs: dict,
        custom_features: Optional[Features] = None,
    ) -> str:
...
# stuck in this line
suffix = Hasher.hash(config_kwargs_to_add_to_suffix)

Thanks for you help.

lhoestq · September 23, 2023, 12:26pm

The hashing part simply dumps the object using pickle and hashes the resulting bytes.

Not sure how the vscode debugging could make this slower though…

bossalex · September 23, 2023, 1:41pm

Thanks, a little weird.

mariosasko · October 3, 2023, 2:46pm

According to

running a debugger in Python < 3.12 “can have a severe impact on performance. Slowdowns by an order of magnitude are common.”

So I think this explains the issue.

bossalex · October 10, 2023, 5:04am

I really appreciate your assistance with my question. Thank you!

Topic		Replies	Views
Is there a suggested way of debugging dataset generators? 🤗Datasets	3	1466	January 26, 2023
Create a dataset from generator 🤗Datasets	7	7700	January 30, 2024
Using PyTorch Dataset Class with Dataset Builder 🤗Datasets	3	54	January 29, 2025
Serially creating a very large dataset using from_generator(), slower each iteration (slows to >2-3s per example!) 🤗Datasets	1	752	May 18, 2023
Cannot stream custom dataset 🤗Datasets	1	531	October 11, 2023

Dataset.from_generator() cost much more time in vscode debugging mode then running mode

Related topics