TypeError: Couldn't cast to null

python 3.12.12

datasets 4.2.0

Hi Everyone!

I was hoping you could help me understand where this error might be coming from and how to resolve it. Here is a code sample:

from datasets import load_dataset
import json

# creating representatitve data
person1 = {
    "employee": {"name": "Janice", "age": 25, "departments": []},
    "divisions": ["DivA", "DivB"],
}
person2 = {
    "employee": {"name": "Jake", "age": 30, "departments": ["IT", "Planning"]},
    "divisions": "DivC",
}

# writing data to json file for testing
people = [person1, person2]
counter = 0
paths = []
for person in people:
    counter += 1
    path = "./person{}.json".format(counter)
    paths.append(path)
    with open(path, "w") as f:
        json.dump(person, f)

# create HFDataset from paths
hf_ds = load_dataset("json", data_files=paths)

And here is the traceback:

Traceback (most recent call last):
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/builder.py", line 1831, in _prepare_split_single
    writer.write_table(table)
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/arrow_writer.py", line 714, in write_table
    pa_table = table_cast(pa_table, self._schema)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 2272, in table_cast
    return cast_table_to_schema(table, schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 2224, in cast_table_to_schema
    cast_array_to_feature(
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 1795, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 2002, in cast_array_to_feature
    _c(array.field(name) if name in array_fields else null_array, subfeature)
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 2052, in cast_array_to_feature
    casted_array_values = _c(array.values, feature.feature)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 2086, in cast_array_to_feature
    return array_cast(
           ^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/table.py", line 1948, in array_cast
    raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")
TypeError: Couldn't cast array of type string to null

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/dirac/mscrw/tests/hf_ds_test.py", line 26, in <module>
    hf_ds = load_dataset("json", data_files=paths)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/load.py", line 1417, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/builder.py", line 894, in download_and_prepare
    self._download_and_prepare(
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/builder.py", line 970, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/builder.py", line 1702, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dirac/mscrw/.venv/lib/python3.12/site-packages/datasets/builder.py", line 1858, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
```

I have a hunch about what it could be, but I was hoping to get some feedback here. If you need any more information, just let me know. Thanks in advance for your help!

1 Like

Hmm… Unsupported JSON structure or so?


Root cause: your JSON mixes shapes and uses empty lists early. PyArrow locks the list element type as null when it first sees only empty values. Later strings cannot be cast to list<item: null>, so Datasets raises TypeError: Couldn't cast array of type string to null. In your sample: divisions is a list in one file and a string in the other, and employee.departments is [] first then a list of strings. This matches known Datasets/PyArrow behavior where schema is fixed after the first chunk that contains only null/empty values. (GitHub)

What is happening, step by step

  1. load_dataset("json", ...) streams records in Arrow batches and infers a schema from early chunks. If an array column only contains [] or null at that stage, Arrow picks list<item: null>. (Apache Arrow)
  2. When a later chunk contains concrete strings, Arrow has to cast string into null, which is invalid. The cast fails and bubbles up as your traceback. Community reports describe the same “schema locked after early nulls” failure mode. (GitHub)

Two clean fixes

A) Make the JSON structurally consistent before loading
Keep column shapes stable. Lists should be lists everywhere. Strings should be strings everywhere. Then declare the schema so Arrow doesn’t guess null for list element types.

# https://huggingface.co/docs/datasets/en/about_dataset_features
from datasets import load_dataset, Features, Value, Sequence

def as_list(x):
    if x is None:
        return []
    return x if isinstance(x, list) else [x]

person1 = {
    "employee": {"name": "Janice", "age": 25, "departments": as_list([])},
    "divisions": as_list(["DivA", "DivB"]),
}
person2 = {
    "employee": {"name": "Jake", "age": 30, "departments": as_list(["IT", "Planning"])},
    "divisions": as_list("DivC"),  # becomes ["DivC"]
}

# write JSON files as you do, then:
features = Features({
    "employee": {
        "name": Value("string"),
        "age": Value("int64"),
        "departments": Sequence(Value("string")),
    },
    "divisions": Sequence(Value("string")),
})

hf_ds = load_dataset("json", data_files=["./person1.json", "./person2.json"], features=features)
# docs: https://huggingface.co/docs/datasets/en/about_dataset_features

Why this works: you unify shapes and you tell Datasets the intended element type for lists (Sequence(Value("string"))), preventing Arrow from inferring null on empty lists. HF’s Features docs recommend declaring features when inference is ambiguous. (Hugging Face)

B) You cannot change the files → normalize on the fly and give a schema
Wrap scalars into lists and default missing fields to [] in a generator, then build the dataset with an explicit Features.

# refs:
# - features: https://huggingface.co/docs/datasets/en/about_dataset_features
# - similar cast issues: https://github.com/huggingface/datasets/issues/7222
from datasets import Dataset, Features, Value, Sequence
import json, pathlib

features = Features({
    "employee": {
        "name": Value("string"),
        "age": Value("int64"),
        "departments": Sequence(Value("string")),
    },
    "divisions": Sequence(Value("string")),
})

def gen(paths):
    for p in paths:
        obj = json.loads(pathlib.Path(p).read_text())
        # normalize divisions to list[str]
        dv = obj.get("divisions", [])
        obj["divisions"] = dv if isinstance(dv, list) else [dv] if dv is not None else []
        # normalize departments to list[str]
        deps = obj.get("employee", {}).get("departments", [])
        obj["employee"]["departments"] = deps if isinstance(deps, list) else [deps] if deps is not None else []
        yield obj

paths = ["./person1.json", "./person2.json"]
ds = Dataset.from_generator(gen, gen_kwargs={"paths": paths}, features=features)

Why this works: you feed Arrow rows that already match the declared schema; the cast to a null element type never occurs. Similar workarounds are used in issues where mixed null/filled batches cause casts to fail. (GitHub)

Debugging checklist you can apply to any dataset

  • List vs scalar: never mix. ["DivC"] not "DivC".
  • Empty lists: safe only if the element type is declared. Otherwise Arrow may set list<item: null>. (Apache Arrow)
  • Declare features= for any nested or list columns. This avoids ambiguous inference. (Hugging Face)
  • If an error persists, try reading only the first N rows to see what schema Arrow inferred, then enforce it with features or cast. Community reports show errors appear only when the “problematic” values arrive after schema lock-in. (GitHub)

Notes on versions and stability

  • Your datasets==4.2.0 is recent. Several casting bugs were still being filed in 2024–2025. Upgrading to the latest datasets and pyarrow may reduce edge cases, but schema declaration is still the most reliable fix. See the Datasets releases stream for ongoing fixes. (GitHub)
  • The issue is not Python-specific. It is Arrow schema inference interacting with streaming writes. Similar “string → null” or “int → null” cast failures appear across different datasets and operations. (GitHub)

Why this bites here (background)

  • Arrow uses a fixed schema per column. When it cannot infer an element type for a list because early values are empty, it uses null. That schema becomes authoritative for the writer. Later non-null elements must be cast to that element type; string → null is invalid. Declaring Sequence(Value("string")) gives Arrow the element type up front and avoids the trap. Arrow and HF issues document this null-inference behavior and downstream cast failures. (Apache Arrow)

Practical “do this now”

  1. Change "DivC" to ["DivC"].
  2. Keep departments a list in every file.
  3. Pass the features shown above to load_dataset.
    These three steps resolve your exact repro. Reports with the same error confirm this approach. (GitHub)

Short curated references

  • Root cause and fixes
    • GitHub: “Couldn’t cast … to null in long json” explains schema lock after early null chunks; proposes declaring features. (GitHub)
    • HF docs: “Dataset features” describes declaring Sequence(Value("string")) for list columns. (Hugging Face)
    • Arrow docs: when inference fails, type becomes null; avoid by passing a schema. (Apache Arrow)
  • Similar issues worth skimming
    • Mixed-type or late non-null batches causing cast errors. (GitHub)
    • Other string↔null and schema mis-read cases that look identical during load_dataset. (GitHub)
  • Version context
    • Datasets releases in 2025 show active changes; upgrading can help but schema declaration remains best practice. (GitHub)