KeyError: 'Field "builder_name" does not exist in table schema'

Hello!

I have a 3-level nested DatasetDict for the MultiDoGo datasets (paper splits, domain, train/dev/test) that I am trying to upload on the Hub as a community dataset:

When I am testing the downloading afterwards, I get:

KeyError: β€˜Field β€œbuilder_name” does not exist in table schema’

Seems like something is not right in the dataset_dict.json’s fields… How can I solve this issue?

I have the version datasets 1.16.1.

Thank you

1 Like

I have encountered a similar issue recently. I observed that the schema was not exactly the same among all the files in the dataset and because of this, load_dataset() was failing. So my guess is that most probably one of your files might not have the field β€˜builder_name’.

Hi ! Can you post the full stack trace of the error ? This could help debugging your issue.

Also note that the support for multi-configurations datasets is still WIP (see documentation here), so load_dataset currently merges all your train sets together (and same for test and dev).

I’ve just run into the same error, here’s my stack trace:

>>> from datasets import load_dataset
>>> data = load_dataset('.')
Using custom data configuration .-418a6ac4a70df3d8
Downloading and preparing dataset json/. to /home/dave/.cache/huggingface/datasets/json/.-418a6ac4a70df3d8/0.0.0/c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 13414.62it/s]
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 1955.69it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dave/.local/lib/python3.8/site-packages/datasets/load.py", line 1694, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/dave/.local/lib/python3.8/site-packages/datasets/builder.py", line 595, in download_and_prepare
    self._download_and_prepare(
  File "/home/dave/.local/lib/python3.8/site-packages/datasets/builder.py", line 683, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/dave/.local/lib/python3.8/site-packages/datasets/builder.py", line 1138, in _prepare_split
    writer.write_table(table)
  File "/home/dave/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 473, in write_table
    pa_table = pa.Table.from_arrays([pa_table[name] for name in self._schema.names], schema=self._schema)
  File "/home/dave/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 473, in <listcomp>
    pa_table = pa.Table.from_arrays([pa_table[name] for name in self._schema.names], schema=self._schema)
  File "pyarrow/table.pxi", line 1339, in pyarrow.lib.Table.__getitem__
  File "pyarrow/table.pxi", line 1900, in pyarrow.lib.Table.column
  File "pyarrow/table.pxi", line 1875, in pyarrow.lib.Table._ensure_integer_index
KeyError: 'Field "builder_name" does not exist in table schema'

I have encountered the same issue and solved it with β€œload_from_disk”

from datasets import load_from_disk

dataset = load_from_disk(data_dir)
1 Like

Perfect! This finally allowed me to use .push_to_hub() successfully as well. :hugs: