Convert dataset to pytorch dataloader

I’m trying to convert a Huggingface dataset into a pytorch dataloader.
I’m trying to do it in streaming mode to avoid downloading a huge amount of data. I have the following so far:

dataset = load_dataset("speech_commands",  "v0.02", streaming=True)
all_columns         = dataset["train"].column_names
columns_to_remove   = set(all_columns) - set(['audio', 'label'])
trainset            = dataset["train"].remove_columns(columns_to_remove)
trainset            = trainset.map(lambda e: {'audio': e['audio']['array'], 'label': e['label']})
trainset            = trainset.with_format(type='torch')
dataloader          = DataLoader(trainset, batch_size=4)

for batch in dataloader:
   audioTensor  = batch['audio']
   targetTensor = batch['label']

Is this the best way to do it?
This seems clunky.

Also, for whatever reason, it’s not caching the .map() transform, even though the hash for the lambda is the same on every session. What’s going on there?

Hi ! A bit simpler:

dataset = load_dataset("speech_commands",  "v0.02", streaming=True)
trainset            = dataset["train"].select_columns(['audio', 'label'])
trainset            = trainset.map(lambda e: {'audio': e['audio']['array'], 'label': e['label']})
dataloader          = DataLoader(trainset, batch_size=4)

for batch in dataloader:
   audioTensor  = batch['audio']
   targetTensor = batch['label']

PS: IterableDataset.with_format is currently a no-op, but soon will implement the conversion to torch Tensors directly in an optimized way.

Caching only happens with datastes on disk. On the other hand, nothing is written on disk or cached for datasets in streaming mode