ArrowTypeError: Expected bytes, got a 'float' object, when trying to make a dataset from a list of dicts

mu1990 · October 15, 2023, 3:10am

Im trying to create a dataset from a list of dictionaries as this:

sp_train = np.load('data/SP-train.npy', allow_pickle=True)

sp_train_list = [sp_train.item(i) for i in range(sp_train.size)]

sp_train_dataset = Dataset.from_list(sp_train_list)

But im getting the following error:
ArrowTypeError: Expected bytes, got a ‘float’ object

Thank you!

lhoestq · October 16, 2023, 1:58pm

Hi !

Dataset.from_list expects a list of dictionaries, e.g.

data = [{"text": "foo", "label": 0}, {"text": "bar", "label": 1}]
ds = Dataset.from_list(data)

What does your sp_train contain ?

mu1990 · October 16, 2023, 3:31pm

Hi
it contains 507 dictionaries:

>>> sp_train
>>> {'id': 'SP-201', 'question': 'Imagine you are in a room, with no doors, windows, or anything. How do you get out?', 'answer': 'Stop imagining.', 'distractor1': 'Break the Wall.', 'distractor2': 'Jump out of the roof.', 'distractor(unsure)': 'None of above.', 'label': 0, 'choice_list': ['Stop imagining.', 'Jump out of the roof.', 'Break the Wall.', 'None of above.'], 'choice_order': [0, 2, 1, 3]}
>>> sp_train.shape
>>> (507,)
>>> sp_train.dtype
>>> object

it is a 1 dimensional ndarray

lhoestq · October 16, 2023, 8:59pm

Could it be that one of the dictionary contains data of different type that the other ones ?

Also if you can share the full stack trace it can be useful to understand what’s happening

mu1990 · October 17, 2023, 1:02am

Can I use something like

ndarray.astype(*dtype*, *order='K'*, *casting='unsafe'*, *subok=True*, *copy=True* )

to solve this? or doesn’t have any sesne?

Thank you so much for all your help

mu1990 · October 17, 2023, 2:15am

This is the whole stack trace:

ArrowTypeError Traceback (most recent call last)

in <cell line: 21>()
19 sp_train_list
20
—> 21 sp_train_dataset = Dataset.from_list(sp_train_list)
22
23 # print(sp_train_dataset)

12 frames

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in from_list(cls, mapping, features, info, split)
949 # for simplicity and consistency wrt OptimizedTypedSequence we do not use InMemoryTable.from_pylist here
950 mapping = {k: [r.get(k) for r in mapping] for k in mapping[0]} if mapping else {}
→ 951 return cls.from_dict(mapping, features, info, split)
952
953 @staticmethod

/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py in from_dict(cls, mapping, features, info, split)
909 arrow_typed_mapping[col] = data
910 mapping = arrow_typed_mapping
→ 911 pa_table = InMemoryTable.from_pydict(mapping=mapping)
912 if info is None:
913 info = DatasetInfo()

/usr/local/lib/python3.10/dist-packages/datasets/table.py in from_pydict(cls, *args, **kwargs)
797 datasets.table.Table
798 “”"
→ 799 return cls(pa.Table.from_pydict(*args, **kwargs))
800
801 @classmethod

/usr/local/lib/python3.10/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/usr/local/lib/python3.10/dist-packages/pyarrow/table.pxi in pyarrow.lib._from_pydict()

/usr/local/lib/python3.10/dist-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/usr/local/lib/python3.10/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.10/dist-packages/pyarrow/array.pxi in pyarrow.lib._handle_arrow_array_protocol()

/usr/local/lib/python3.10/dist-packages/datasets/arrow_writer.py in arrow_array(self, type)
187 else:
188 trying_cast_to_python_objects = True
→ 189 out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
190 # use smaller integer precisions if possible
191 if self.trying_int_optimization:

/usr/local/lib/python3.10/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.10/dist-packages/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: Expected bytes, got a ‘float’ object

lhoestq · October 17, 2023, 8:40am

Hmm not sure what’s happening exactly. You can try to convert your numpy array to a list before passing it to from_list. If it shows the same error it’s likely an issue with the data

mu1990 · October 17, 2023, 5:29pm

I sent the instances from 0 to 218 as this:

sp_train_dataset = Dataset.from_list(sp_train_list[:219])

it seems to be that the following dictionaries (219, 220 and 221) are the issue, there are float type values on some keys:

This one is OK 218 {'id': 'SP-87_CR', 'question': 'How to fit an entire mountain inside a suitcase?', 'answer': 'None of above.', 'distractor1': 'Cut the mountain in small pieces.', 'distractor2': 'Dissolve the mountain in the water.', 'distractor(unsure)': 'Heat the mountain in a high temperature.', 'label': 3, 'choice_list': ['Dissolve the mountain in the water.', 'Heat the mountain in a high temperature.', 'Cut the mountain in small pieces.', 'None of above.'], 'choice_order': [2, 3, 1, 0]}

219 {'id': 'SP-88', 'question': "Name the most recent year in which New Year's came before Christmas.\n", 'answer': 'This year. \n', 'distractor1': 'Last year.', 'distractor2': 2000.0, 'distractor(unsure)': 'None of above.', 'label': 0, 'choice_list': ['This year. \n', 'Last year.', '2000.0', 'None of above.'], 'choice_order': [0, 1, 2, 3]}

220 {'id': 'SP-88_SR', 'question': "What was the most recent year when New Year's arrive before Christmas?", 'answer': 'This year. \n', 'distractor1': 'Last year.', 'distractor2': 2000.0, 'distractor(unsure)': 'None of above.', 'label': 2, 'choice_list': ['2000.0', 'Last year.', 'This year. \n', 'None of above.'], 'choice_order': [2, 1, 0, 3]}

221 {'id': 'SP-88_CR', 'question': "Which year in the history that New Year's arrive before Christmas?", 'answer': 'Every year.\n', 'distractor1': 2020.0, 'distractor2': 2000.0, 'distractor(unsure)': 1776.0, 'label': 1, 'choice_list': ['1776.0', 'Every year.\n', '2020.0', 'None of above.'], 'choice_order': [3, 0, 1, 2]}

Why it is not allowed to have different data types for the same columns? Is there any parameter to convert this columns to strings?

mu1990 · October 17, 2023, 8:07pm

What I did was the following to solve it:

for dictionary in sp_train_list:
  dictionary['distractor1'] = str(dictionary['distractor1'])
  dictionary['distractor2'] = str(dictionary['distractor2'])
  dictionary['distractor(unsure)'] = str(dictionary['distractor(unsure)'])

Thank you once again!

adinhobl-asapp · April 30, 2024, 3:43pm

I ran into this same error message, and it looks like it is a similar situation to mu1990’s original question. Here is a reproducible example, with lists of dictionaries.

Dataset.from_list([{"hi":{"bye":[0.2934890238]}}]) #works
Dataset.from_list([{"hi":{"bye":["hi", "bye"]}}]) #works
Dataset.from_list([{"hi":{"bye":["hi",0.2934890238]}}]) #errors

The issue seems to appear when you have a nested list of dicts, and one of the nested values is a dict with multiple types. For the erroring example, the first item in the inner list is a string, but the second one is a float. Floats by themselves work, as do lists of strings. I’m guessing whatever tool is parsing the list expects all these inner list items to be the same type.

I have a dataset where the “bye” key is something like “tags”, and the tags are a list of values different tools have added. I could probably remake it as a dict instead.

lhoestq · May 13, 2024, 4:28pm

FYI we use Arrow to store the data, and as it is a columnar format it expects each array to be of fixed type (like numpy)

Topic		Replies	Views
ArrowTypeError in load_dataset 🤗Datasets	1	620	June 12, 2023
Train_test_split with a dataset loaded from dict Beginners	1	647	November 9, 2022
TypeError: Values in `DatasetDict` should be of type `Dataset` but got type '<class 'dict'>' Solved 🤗Datasets	0	1167	July 20, 2023
Load Dataset Fail for Custom Json Format Beginners	3	8393	June 20, 2023
Proprietary database load error: TypeError: Argument 'storage' has incorrect type (expected pyarrow.lib.Array, got pyarrow.lib.ChunkedArray) 🤗Datasets	2	1129	January 25, 2022

ArrowTypeError: Expected bytes, got a 'float' object, when trying to make a dataset from a list of dicts

Related topics