How to combine local data files with an official 🤗 dataset

Hey,

I made a short notebook to show how local data files can be loaded into :hugs: Datasets and
consequently, be combined with local data files into one Dataset object.

Check out the google colab here.

12 Likes

So, if I have a very different dataset, but I have the sentence and the mp3 (doesnt matter the audio quality, silence or if major part of the file is silent/background noise).

I only need to create that json file and use it as a base dataset?

Also it seems that you can save the json files with the correct names for the columns and the rename after load is not needed if you do it that way.

Found how to load from pandas… but now I got while concatenating

ValueError: Datasets should ALL come from memory, or should ALL come from disk.
However datasets [1] come from memory and datasets [0] come from disk.

:frowning:

I had the same issue before, so I just saved the dataset to disk and reload it again like:

dataset.save_to_disk("train_dataset")
dataset = datasets.load_from_disk("train_dataset")

the I can concatenate the two datasets

1 Like

I’m doing something like this

Move to memory

common_voice_train = common_voice_train.map(lambda x:x,keep_in_memory=True)

Move to disk

common_voice_train = common_voice_train.map(lambda x:x,keep_in_memory=False)

1 Like

hey anyone can help me with this

i get the cause my json file containes list inside that list my json data exist now can anyone knows how can i read it in load_dataset?

Try to create first a list from file paths and only then pass it to load_dataset


i have already create and passing the list of file path these are the contents of my json file, but why its giving errror


now its giving me this error, but why?

@danurahul I think this issue may happen if your JSON couldn’t be read properly by arrow (ArrowInvalid error). Can you try with one single file to begin with and to be able to debug properly ?

We recently pushed a feature that allows to concatenate any datasets without getting this error !
Currently this feature is only available on master but we’ll do a new release soon

with one single json its working fine but as I increasing the files its giving error

sure i will check that out

One of your files must have format issues or different fields that the other json files.