Fine tune a model from a script-based dataset

asennoussi · January 27, 2023, 3:16am

Hi, I’m trying to follow this tutorial to fine-tune whisper
But the change I’m having here is that I’m using a data set that I created using script as detailed here

Now when I load my dataset and try to print it as follows:

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()

common_voice["train"] = load_dataset("user/dataset-name", split="train", use_auth_token=True, streaming=True)

common_voice["test"] = load_dataset("user/dataset-name", split="test", use_auth_token=True, streaming=True)

print(common_voice)

It returns something like this:

DatasetDict({
    train: <datasets.iterable_dataset.IterableDataset object at 0x7f151f995760>
    test: <datasets.iterable_dataset.IterableDataset object at 0x7f151f9a3d60>
})

If I’d like to get something as described in the tutorial:

DatasetDict({
    train: Dataset({
        features: [list_of_features],
        num_rows: 6540
    })
    test: Dataset({
        features: [list_of_features],
        num_rows: 2894
    })
})

What am I missing?

stevhliu · January 27, 2023, 5:17pm

Hi!

It’s because when streaming=True, it’ll return an IterableDataset which is different from a regular Dataset object. To return a regular Dataset object, you can just remove that argument from load_dataset:

from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()
common_voice["train"] = load_dataset("user/dataset-name", split="train", use_auth_token=True)

asennoussi · January 28, 2023, 10:25am

I still want to stream the data not download it. I don’t have enough space

Topic		Replies	Views
Load iterable dataset from disk Beginners	2	2136	July 21, 2022
How to get model output to retain \n from dataset? Beginners	0	291	July 29, 2022
How do I iterate through <class 'datasets.dataset_dict.IterableDatasetDict'>? Beginners	2	2937	January 15, 2024
Load_dataset assumes 'train' Beginners	2	932	May 31, 2023
How to fine tune a model for text generation? Course	0	1020	July 4, 2023

Fine tune a model from a script-based dataset

Related topics