Hey there,
I’m using Dataset.from_generator() to convert a torch_dataset to the Huggingface Dataset.
However, when I debug my code on vscode, I find that it runs really slow on Dataset.from_generator() which may even 10 times longer then run the script on terminal.
Here is a simple test I tried:
import os
from functools import partial
from typing import Callable
import torch
import time
from torch.utils.data import Dataset as TorchDataset
from datasets import load_from_disk, Dataset as HFDataset
import torch
from torch.utils.data import Dataset
class SimpleDataset(Dataset):
def __init__(self, data):
self.data = data
self.keys = list(data[0].keys())
def __len__(self):
return len(self.data)
def __getitem__(self, index):
sample = self.data[index]
return {key: sample[key] for key in self.keys}
def TorchDataset2HuggingfaceDataset(torch_dataset: TorchDataset, cache_dir: str = None
) -> HFDataset:
"""
convert torch dataset to huggingface dataset
"""
generator : Callable[[], TorchDataset] = lambda: (sample for sample in torch_dataset)
return HFDataset.from_generator(generator, cache_dir=cache_dir)
if __name__ == '__main__':
data = [
{'id': 1, 'name': 'Alice'},
{'id': 2, 'name': 'Bob'},
{'id': 3, 'name': 'Charlie'}
]
torch_dataset = SimpleDataset(data)
start_time = time.time()
huggingface_dataset = TorchDataset2HuggingfaceDataset(torch_dataset)
end_time = time.time()
print("time: ", end_time - start_time)
print(huggingface_dataset)
this test on my machine report that the running time on terminal is 0.086,
however the running time in debugging mode on vscode is 0.25, which I think is much longer than expected.
I’d like to know is the anything wrong in the code or just because of debugging?
I have traced the code and I find is this func which I get stuck.
In datasets.builder.BuilderCofing
def create_config_id(
self,
config_kwargs: dict,
custom_features: Optional[Features] = None,
) -> str:
...
# stuck in this line
suffix = Hasher.hash(config_kwargs_to_add_to_suffix)
Thanks for you help.