I’m fairly new here and I try to use the blog post by @patrickvonplaten on wav2vec2 & Turkish Common Voice dataset. And got stuck at the very start.
I use custom splits I created locally, so downloading from HF datasets is not an option. I hit the wall with the Audio field as you might guess. I checked the dataset.Features and related posts but could not get it done.
Any direction is much appreciated.
Bülent Özden
Edit: This is part of the code I’m trying last:
# Using local Common Voice dataset (can be custom splits)
datasetdir = DATASETPATH + LANGCODE + "/"
SEP = "\t"
train_split = f"{datasetdir}train.tsv"
dev_split = f"{datasetdir}dev.tsv"
test_split = f"{datasetdir}test.tsv"
validated = f"{datasetdir}validated.tsv"
#splits = {"train": train_split, "validation": dev_split, "test": test_split}
train_splits = {"train": train_split, "validation": dev_split}
test_splits = {"test": test_split}
# Specify features for adding the Audio field
features = datasets.Features({
"client_id": datasets.Value(dtype='string', id=None),
"path": datasets.Value(dtype='string', id=None),
"sentence": datasets.Value(dtype='string', id=None),
"up_votes": datasets.Value(dtype='int64', id=None),
"down_votes": datasets.Value(dtype='int64', id=None),
"age": datasets.Value(dtype='string', id=None),
"gender": datasets.Value(dtype='string', id=None),
"accent": datasets.Value(dtype='string', id=None),
"locale": datasets.Value(dtype='string', id=None),
"segment": datasets.Value(dtype='string', id=None),
"audio": datasets.Audio(sampling_rate=48000, mono=True, decode=True, id=None)
})
train = load_dataset("csv", sep=SEP, data_files=train_splits, features=features)
test = load_dataset("csv", sep=SEP, data_files=test_splits, features=features)
I’m getting the following error:
...
d:\Anaconda\anaconda3\envs\hf-pytorch\lib\site-packages\pyarrow\types.pxi in pyarrow.lib.DataType.to_pandas_dtype()
NotImplementedError: struct<bytes: binary, path: string>
I’m on Windows x64 with Anaconda+
torch 1.11.0+cu113
transformers 4.18.0
datasets 2.1.0
tokenizers 0.12.1