I was looking for a workaround. So I decided to read the file using pandas. However, I ran into an error
Downloading and preparing dataset aliases/default to /Users/home/.cache/huggingface/datasets/aliases/default/1.0.0/5d933aa1538259f753a65ea696f0e78a15480e53c1852b167f94d41433e6a1d7...
Generating train split: 0%| | 0/98149 [00:04<?, ? examples/s]Traceback (most recent call last):
File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/builder.py", line 1833, in _prepare_split_single
writer.write_table(table)
File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/arrow_writer.py", line 567, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/table.py", line 2312, in table_cast
return cast_table_to_schema(table, schema)
File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/table.py", line 2272, in cast_table_to_schema
return pa.Table.from_arrays(arrays, schema=schema)
File "pyarrow/table.pxi", line 3657, in pyarrow.lib.Table.from_arrays
File "pyarrow/table.pxi", line 1421, in pyarrow.lib._sanitize_arrays
File "pyarrow/array.pxi", line 347, in pyarrow.lib.asarray
File "pyarrow/table.pxi", line 523, in pyarrow.lib.ChunkedArray.cast
File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/pyarrow/compute.py", line 391, in cast
return call_function("cast", [arr], options)
File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from list<item: struct<url: string, score: double>> to struct using function cast_struct
After searching online for a solution, I came across this [ARROW-1888] [C++] Implement casts from one struct type to another (with same field names and number of fields) - ASF JIRA, which has since been resolved. I decided to add a print statement before the line to check the schema being generated, and this is what it produces:
names: struct<url: list<item: string>, score: list<item: double>>
child 0, url: list<item: string>
child 0, item: string
child 1, score: list<item: double>
child 0, item: double
Shouldn’t it have been:
names: list<item: struct<url: string, score: double>>
child 0, item: struct<url: string, score: double>
child 0, url: string
child 1, score: double
This is how I define the schema inside _info()
"names" : datasets.Sequence(
datasets.Features(
{
"url": datasets.Value("string"),
"score": datasets.Value("float64"),
}
))
I also tried this:
"names" : datasets.Sequence(
{
"url": datasets.Value("string"),
"score": datasets.Value("float64"),
}
)
The said field is a list of dictionaries. The dictionary has 2 keys; the values are - string and floating point number. So, is my definition of a list of dictionaries correct? If so, why is the returned schema incorrect?
Update
If it is of any help, I checked the feature being generated inside the arrow_writer.py module, this is what it produces
‘names’: Sequence(feature={‘url’: Value(dtype=‘string’, id=None), ‘score’: Value(dtype=‘float64’, id=None)}, length=-1, id=None)}. However, the corresponding type is:
names: struct<url: list<item: string>, score: list<item: double>>