Fine tune a model from a script-based dataset

Hi, I’m trying to follow this tutorial to fine-tune whisper
But the change I’m having here is that I’m using a data set that I created using script as detailed here

Now when I load my dataset and try to print it as follows:

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("user/dataset-name", split="train", use_auth_token=True, streaming=True)

common_voice["test"] = load_dataset("user/dataset-name", split="test", use_auth_token=True, streaming=True)

print(common_voice)

It returns something like this:

DatasetDict({
    train: <datasets.iterable_dataset.IterableDataset object at 0x7f151f995760>
    test: <datasets.iterable_dataset.IterableDataset object at 0x7f151f9a3d60>
})

If I’d like to get something as described in the tutorial:

DatasetDict({
    train: Dataset({
        features: [list_of_features],
        num_rows: 6540
    })
    test: Dataset({
        features: [list_of_features],
        num_rows: 2894
    })
})

What am I missing?

Hi!

It’s because when streaming=True, it’ll return an IterableDataset which is different from a regular Dataset object. To return a regular Dataset object, you can just remove that argument from load_dataset:

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()
common_voice["train"] = load_dataset("user/dataset-name", split="train", use_auth_token=True)

I still want to stream the data not download it. I don’t have enough space