ArrowInvalidError

When reading a json line delimited file using pyarrow’s read_json(), I get the following error:

Generating train split: 0 examples [00:00, ? examples/s]Traceback (most recent call last):
  File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/builder.py", line 1819, in _prepare_split_single
    for _, table in generator:
  File "/Users/home/.cache/huggingface/modules/datasets_modules/datasets/names/b87f6e0e1adeb39a62ff9629a9967f8226247470eeec48bf080721a58ee7377d/names.py", line 121, in _generate_tables
    table = read_json(filepath)
  File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

However, I can read the original file just fine

>>> from pyarrow.json import read_json
>>> d = read_json("/Users/home/Downloads/names.jsonlist.gz")
>>> d.schema
url: string
title: string
gender: string
type: string
...

It is unable to read the downloaded file. Any clue why this is happening? The reason I am using pyarrow’s read_json() is because it is almost 3x faster than pandas’ read_json()

I was looking for a workaround. So I decided to read the file using pandas. However, I ran into an error

Downloading and preparing dataset aliases/default to /Users/home/.cache/huggingface/datasets/aliases/default/1.0.0/5d933aa1538259f753a65ea696f0e78a15480e53c1852b167f94d41433e6a1d7...
Generating train split:   0%|                                                                                                                                                                                                                  | 0/98149 [00:04<?, ? examples/s]Traceback (most recent call last):
  File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/builder.py", line 1833, in _prepare_split_single
    writer.write_table(table)
  File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/arrow_writer.py", line 567, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/table.py", line 2312, in table_cast
    return cast_table_to_schema(table, schema)
  File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/datasets/table.py", line 2272, in cast_table_to_schema
    return pa.Table.from_arrays(arrays, schema=schema)
  File "pyarrow/table.pxi", line 3657, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1421, in pyarrow.lib._sanitize_arrays
  File "pyarrow/array.pxi", line 347, in pyarrow.lib.asarray
  File "pyarrow/table.pxi", line 523, in pyarrow.lib.ChunkedArray.cast
  File "/Users/home/.local/share/virtualenvs/env-sIFPHfLo/lib/python3.8/site-packages/pyarrow/compute.py", line 391, in cast
    return call_function("cast", [arr], options)
  File "pyarrow/_compute.pyx", line 560, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 355, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from list<item: struct<url: string, score: double>> to struct using function cast_struct

After searching online for a solution, I came across this [ARROW-1888] [C++] Implement casts from one struct type to another (with same field names and number of fields) - ASF JIRA, which has since been resolved. I decided to add a print statement before the line to check the schema being generated, and this is what it produces:

names: struct<url: list<item: string>, score: list<item: double>>
  child 0, url: list<item: string>
      child 0, item: string
  child 1, score: list<item: double>
      child 0, item: double

Shouldn’t it have been:

names: list<item: struct<url: string, score: double>>
  child 0, item: struct<url: string, score: double>
      child 0, url: string
      child 1, score: double

This is how I define the schema inside _info()

"names" : datasets.Sequence(
datasets.Features(
{
"url": datasets.Value("string"),
"score": datasets.Value("float64"),
}
))

I also tried this:

"names" : datasets.Sequence(
{
"url": datasets.Value("string"),
"score": datasets.Value("float64"),
}
)

The said field is a list of dictionaries. The dictionary has 2 keys; the values are - string and floating point number. So, is my definition of a list of dictionaries correct? If so, why is the returned schema incorrect?

Update
If it is of any help, I checked the feature being generated inside the arrow_writer.py module, this is what it produces
‘names’: Sequence(feature={‘url’: Value(dtype=‘string’, id=None), ‘score’: Value(dtype=‘float64’, id=None)}, length=-1, id=None)}. However, the corresponding type is:
names: struct<url: list<item: string>, score: list<item: double>>

From the arrow documentation, it states that it automatically decompresses the file based on the extension name, which is stripped away from the Download module. So, this explains why it failed.

My question is: is it possible to speed up the reading of the json file?

To define a list of dictionaries you must use

"names" : [{
    "url": datasets.Value("string"),
    "score": datasets.Value("float64"),
}]

since Sequence inverts the order of the dict and the list, see Main classes

A Sequence with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a python list instead of the Sequence.

@lhoestq
Thank you! That is what I did eventually.

Somehow I missed the definition or misread the definition in the documentation