[SOLVED] How to import a custom dataset (wav2vec2 & Common Voice)?

I’m fairly new here and I try to use the blog post by @patrickvonplaten on wav2vec2 & Turkish Common Voice dataset. And got stuck at the very start.

I use custom splits I created locally, so downloading from HF datasets is not an option. I hit the wall with the Audio field as you might guess. I checked the dataset.Features and related posts but could not get it done.

Any direction is much appreciated.

Bülent Özden

Edit: This is part of the code I’m trying last:

# Using local Common Voice dataset (can be custom splits)
datasetdir = DATASETPATH + LANGCODE + "/"
SEP = "\t"
train_split = f"{datasetdir}train.tsv"
dev_split = f"{datasetdir}dev.tsv"
test_split = f"{datasetdir}test.tsv"
validated = f"{datasetdir}validated.tsv"
#splits = {"train": train_split, "validation": dev_split, "test": test_split}
train_splits = {"train": train_split, "validation": dev_split}
test_splits = {"test": test_split}

# Specify features for adding the Audio field
features = datasets.Features({
    "client_id": datasets.Value(dtype='string', id=None),
    "path": datasets.Value(dtype='string', id=None),
    "sentence": datasets.Value(dtype='string', id=None),
    "up_votes": datasets.Value(dtype='int64', id=None),
    "down_votes": datasets.Value(dtype='int64', id=None),
    "age": datasets.Value(dtype='string', id=None),
    "gender": datasets.Value(dtype='string', id=None),
    "accent": datasets.Value(dtype='string', id=None),
    "locale": datasets.Value(dtype='string', id=None),
    "segment": datasets.Value(dtype='string', id=None),
    "audio": datasets.Audio(sampling_rate=48000, mono=True, decode=True, id=None)
})


train = load_dataset("csv", sep=SEP, data_files=train_splits, features=features)
test = load_dataset("csv", sep=SEP, data_files=test_splits, features=features)

I’m getting the following error:

...
d:\Anaconda\anaconda3\envs\hf-pytorch\lib\site-packages\pyarrow\types.pxi in pyarrow.lib.DataType.to_pandas_dtype()

NotImplementedError: struct<bytes: binary, path: string>

I’m on Windows x64 with Anaconda+
torch 1.11.0+cu113
transformers 4.18.0
datasets 2.1.0
tokenizers 0.12.1

Hi! Casting to the Audio feature type is not currently supported in our loaders. It’s on our TO-DO list. You can bypass this by loading the data first with a simple cast and then do the more complex one to get the Audio type:

features = datasets.Features({
    "client_id": datasets.Value(dtype='string', id=None),
    "path": datasets.Value(dtype='string', id=None),
    "sentence": datasets.Value(dtype='string', id=None),
    "up_votes": datasets.Value(dtype='int64', id=None),
    "down_votes": datasets.Value(dtype='int64', id=None),
    "age": datasets.Value(dtype='string', id=None),
    "gender": datasets.Value(dtype='string', id=None),
    "accent": datasets.Value(dtype='string', id=None),
    "locale": datasets.Value(dtype='string', id=None),
    "segment": datasets.Value(dtype='string', id=None),
    "audio": datasets.Value("string"), 
})


train = load_dataset("csv", sep=SEP, data_files=train_splits, features=features)
test = load_dataset("csv", sep=SEP, data_files=test_splits, features=features)

train = train.cast_column("audio", datasets.Audio(sampling_rate=48000, mono=True, decode=True, id=None))
test = test.cast_column("audio", datasets.Audio(sampling_rate=48000, mono=True, decode=True, id=None))
2 Likes

Thank you @mariosasko, I was pulling my hair out (assuming they exist) :slight_smile:

I certainly can live with your excellent solution :tada:

Hi, I came across this same issue just recently (using datasets==2.14.0) but this fix no longer seems to work… I see the following error on the audio field:

TypeError: Couldn't cast array of type
struct<bytes: binary, path: string>
to
string

Any chance there’s a newer solution available?

(My situation is actually slightly different as I saved the whole Common Voice DatasetDict with save_to_disk(...), not as a CSV… But seems like it’s the same general issue?)

Hey @Thewz, actually that fix is no longer necessary. As specified above, the missing casting reached from the TO-DO list into implementation.

You should be able to cast it now… For example, I have this in my Whisper-related code:

ds = ds.cast_column("audio", Audio(sampling_rate=SAMPLING_RATE))

Ahh sorry you’re quite right @bozden - I got confused because local vs remote environment were on different versions: Upgrading datasets seems to allow me to directly load_dataset without any feature type edits required :slight_smile:

1 Like