Dataset.from_generator() cost much more time in vscode debugging mode then running mode

Hey there,
I’m using Dataset.from_generator() to convert a torch_dataset to the Huggingface Dataset.
However, when I debug my code on vscode, I find that it runs really slow on Dataset.from_generator() which may even 10 times longer then run the script on terminal.
Here is a simple test I tried:

import os
from functools import partial
from typing import Callable

import torch
import time
from torch.utils.data import Dataset as TorchDataset

from datasets import load_from_disk, Dataset as HFDataset
  
import torch  
from torch.utils.data import Dataset  
  
class SimpleDataset(Dataset):  
    def __init__(self, data):  
        self.data = data  
        self.keys = list(data[0].keys())
      
    def __len__(self):  
        return len(self.data)  
      
    def __getitem__(self, index):  
        sample = self.data[index]  
        return {key: sample[key] for key in self.keys}  
  
 

  

def TorchDataset2HuggingfaceDataset(torch_dataset: TorchDataset, cache_dir: str = None
) -> HFDataset:
    
    """
        convert torch dataset to huggingface dataset
    """
    generator : Callable[[], TorchDataset] = lambda: (sample for sample in torch_dataset)   

    return HFDataset.from_generator(generator, cache_dir=cache_dir)

if __name__ == '__main__':
    data = [  
        {'id': 1, 'name': 'Alice'},  
        {'id': 2, 'name': 'Bob'},  
        {'id': 3, 'name': 'Charlie'}  
    ]
    
    torch_dataset = SimpleDataset(data)
    start_time = time.time() 
    huggingface_dataset = TorchDataset2HuggingfaceDataset(torch_dataset)
    end_time = time.time()
    print("time: ", end_time - start_time)
    print(huggingface_dataset)

this test on my machine report that the running time on terminal is 0.086,
however the running time in debugging mode on vscode is 0.25, which I think is much longer than expected.

I’d like to know is the anything wrong in the code or just because of debugging?
I have traced the code and I find is this func which I get stuck.

In datasets.builder.BuilderCofing

def create_config_id(
        self,
        config_kwargs: dict,
        custom_features: Optional[Features] = None,
    ) -> str:
...
# stuck in this line
suffix = Hasher.hash(config_kwargs_to_add_to_suffix)

Thanks for you help.

The hashing part simply dumps the object using pickle and hashes the resulting bytes.

Not sure how the vscode debugging could make this slower though…

Thanks, a little weird. :cry:

According to

running a debugger in Python < 3.12 “can have a severe impact on performance. Slowdowns by an order of magnitude are common.”

So I think this explains the issue.

1 Like

I really appreciate your assistance with my question. Thank you!