To begin with, Python’s language specification includes None but not null , so I think that null probably comes from pandas or pyarrow.
In pandas, for example, just having “” creates null . I think that such data is causing problems when implicit conversion is performed inside the datasets library. On the other hand, int64 is probably data that has been successfully tokenized.
opened 02:16PM - 19 Jun 23 UTC
closed 03:13PM - 26 Jul 23 UTC
### Describe the bug
When doing a map of a dataset with complex types, sometime… s `datasets` is unable to interpret the valid schema of a returned datasets.map() function. This often comes from conflicting types, like when both empty lists and filled lists are competing for the same field value.
This is prone to happen in batch mapping, when the mapper returns a sequence of null/empty values and other batches are non-null. A workaround is to manually cast the new batch to a pyarrow table (like implemented in this [workaround](https://github.com/piercefreeman/lassen/pull/3)) but it feels like this ideally should be solved at the core library level.
Note that the reproduction case only throws this error if the first datapoint has the empty list. If it is processed later, datasets already detects its representation as list-type and therefore allows the empty list to be provided.
### Steps to reproduce the bug
A trivial reproduction case:
```python
from typing import Iterator, Any
import pandas as pd
from datasets import Dataset
def batch_to_examples(batch: dict[str, list[Any]]) -> Iterator[dict[str, Any]]:
for i in range(next(iter(lengths))):
yield {feature: values[i] for feature, values in batch.items()}
def examples_to_batch(examples) -> dict[str, list[Any]]:
batch = {}
for example in examples:
for feature, value in example.items():
if feature not in batch:
batch[feature] = []
batch[feature].append(value)
return batch
def batch_process(examples, explicit_schema: bool):
new_examples = []
for example in batch_to_examples(examples):
new_examples.append(dict(texts=example["raw_text"].split()))
return examples_to_batch(new_examples)
df = pd.DataFrame(
[
{"raw_text": ""},
{"raw_text": "This is a test"},
{"raw_text": "This is another test"},
]
)
dataset = Dataset.from_pandas(df)
# datasets won't be able to typehint a dataset that starts with an empty example.
with pytest.raises(TypeError, match="Couldn't cast array of type"):
dataset = dataset.map(
batch_process,
batched=True,
batch_size=1,
num_proc=1,
remove_columns=dataset.column_names,
)
```
This results in crashes like:
```bash
File "/Users/piercefreeman/Library/Caches/pypoetry/virtualenvs/example-9kBqeSPy-py3.11/lib/python3.11/site-packages/datasets/table.py", line 1819, in wrapper
return func(array, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/piercefreeman/Library/Caches/pypoetry/virtualenvs/example-9kBqeSPy-py3.11/lib/python3.11/site-packages/datasets/table.py", line 2109, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/piercefreeman/Library/Caches/pypoetry/virtualenvs/example-9kBqeSPy-py3.11/lib/python3.11/site-packages/datasets/table.py", line 1819, in wrapper
return func(array, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/piercefreeman/Library/Caches/pypoetry/virtualenvs/example-9kBqeSPy-py3.11/lib/python3.11/site-packages/datasets/table.py", line 1998, in array_cast
raise TypeError(f"Couldn't cast array of type {array.type} to {pa_type}")
TypeError: Couldn't cast array of type string to null
```
### Expected behavior
The code should successfully map and create a new dataset without error.
### Environment info
Mac OSX, Linux