Loading list as dataset

I understand typically I should save a list of data (e.g., list = [{"text": "...", "label": 1}, {"text": '...', "label": 1},...]) as a json file and use load_dataset("json", data_files="/path/data.json") to load as dataset. However in this use case, this list is generated on the fly, so I would like to load a list directly as a dataset. I used load_dataset(list) but got the following error:

TypeError: expected str, bytes or os.PathLike object, not list

Is there some way to allow me to directly load list as dataset?

Hi! You can use Dataset.from_list(list) to create a Dataset from a Python list.

PS: Dataset.from_list was added to datasets in version 2.5.0, so if your installation is older than that, update it with pip install -U datasets.

4 Likes

worked perfectly. Thanks!

Is it possible to create similar data from list of strings ?

Currently I do
dataset = Dataset.from_list(math_sentences) #math_sentences is array of string

but I got

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[16], line 3
      1 from datasets import Dataset
----> 3 dataset = Dataset.from_list(math_sentences)  #math_sentences is array of string

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\datasets\arrow_dataset.py:950, in Dataset.from_list(cls, mapping, features, info, split)
    934 """
    935 Convert a list of dicts to a `pyarrow.Table` to create a [`Dataset`]`.
    936 
   (...)
    947     [`Dataset`]
    948 """
    949 # for simplicity and consistency wrt OptimizedTypedSequence we do not use InMemoryTable.from_pylist here
--> 950 mapping = {k: [r.get(k) for r in mapping] for k in mapping[0]} if mapping else {}
    951 return cls.from_dict(mapping, features, info, split)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\datasets\arrow_dataset.py:950, in <dictcomp>(.0)
    934 """
    935 Convert a list of dicts to a `pyarrow.Table` to create a [`Dataset`]`.
    936 
   (...)
    947     [`Dataset`]
    948 """
    949 # for simplicity and consistency wrt OptimizedTypedSequence we do not use InMemoryTable.from_pylist here
--> 950 mapping = {k: [r.get(k) for r in mapping] for k in mapping[0]} if mapping else {}
    951 return cls.from_dict(mapping, features, info, split)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\datasets\arrow_dataset.py:950, in <listcomp>(.0)
    934 """
    935 Convert a list of dicts to a `pyarrow.Table` to create a [`Dataset`]`.
    936 
   (...)
    947     [`Dataset`]
    948 """
    949 # for simplicity and consistency wrt OptimizedTypedSequence we do not use InMemoryTable.from_pylist here
--> 950 mapping = {k: [r.get(k) for r in mapping] for k in mapping[0]} if mapping else {}
    951 return cls.from_dict(mapping, features, info, split)

AttributeError: 'str' object has no attribute 'get'
3 Likes

@tempdeltavalue I had the same issue loading a list of strings. The fix was to convert it into a list of dicts, with each dict containing ‘text’ as a key and the actual string as the value.