Strange Error While Attempting to Load DataSet

Hi all, I’m kind of a beginner with the HF Interface, I was trying to load a 16 MB dataset with arabic characters, and I get the following error: I’m honestly confused what the error is.

0/site-packages/datasets/", line 1833, in wrapper
return func(array, *args, **kwargs)
File “/home/user/.local/lib/python3.10/site-packages/datasets/”, line 2027, in array_cast
return array.cast(pa_type)
File “pyarrow/array.pxi”, line 980, in pyarrow.lib.Array.cast
File “/home/user/.local/lib/python3.10/site-packages/pyarrow/”, line 403, in cast
return call_function(“cast”, [arr], options, memory_pool)
File “pyarrow/_compute.pyx”, line 572, in pyarrow._compute.call_function
File “pyarrow/_compute.pyx”, line 367, in
File “pyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow/error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed to parse string: ‘17 - “”’ as a scalar of type int64

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/user/app/”, line 9, in
dataset = load_dataset(‘FDSRashid/hadith_info’,data_files = ‘Basic_Edge_Information.csv’, token = Secret_token, split = ‘train’)
File “/home/user/.local/lib/python3.10/site-packages/datasets/”, line 2153, in load_dataset
File “/home/user/.local/lib/python3.10/site-packages/datasets/”, line 954, in download_and_prepare
File “/home/user/.local/lib/python3.10/site-packages/datasets/”, line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File “/home/user/.local/lib/python3.10/site-packages/datasets/”, line 1813, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File “/home/user/.local/lib/python3.10/site-packages/datasets/”, line 1958, in _prepare_split_single
raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

It looks like your dataset has data of incoherent types. There seems to be a column that is loaded as type “int64” but the dataset content 17 - “” can’t be converted to an integer.

Could you share some data samples and the code you used to load the dataset ? That would be helpful to investigate why you end up with this error

this was precisely the error! i simply loaded the dataset using load_dataset('path/to/dataset') , without any modification to the dataset. there were some invalid rows with their values and some null values in the dataset - pyarrow chose the default datatype to be integers. i made a temporary fix by making a column schema and setting the data type of all the columns to string. however this leads me to my second issue, loading in datasets with null values. even when i set the column type to all be string, null values aren’t read in and load_dataset yields an error . now i’m confused on how to read in datasets with null values using the load_datasets() function. :laughing:

What’s the error messager ? load_dataset should work even if you have null values

this is my column schema : features = Features({'Book_ID': Value('int32'),'taraf_ID': Value('string'), 'Hadith_ID': Value('string'), 'matn': Value('string'), 'taraf_tally': Value('int32'), 'wordcount': Value('string'), 'Domain': Value('string'), 'Category': Value('string'), 'translation': Value('string')}) . When i try to to load in this dataset using this code : dataset = load_dataset("FDSRashid/hadith_info", data_files = 'All_Matns.csv', token = string1, features = features), i get the following error :

Failed to read file '/root/.cache/huggingface/datasets/downloads/ac7e243c60b61b8decc6fc884b4b76a7d6c12164953ec0f10a672362460a1bcd' with error <class 'ValueError'>: cannot safely convert passed user dtype of int32 for object dtyped data in column 4
ERROR:datasets.packaged_modules.csv.csv:Failed to read file '/root/.cache/huggingface/datasets/downloads/ac7e243c60b61b8decc6fc884b4b76a7d6c12164953ec0f10a672362460a1bcd' with error <class 'ValueError'>: cannot safely convert passed user dtype of int32 for object dtyped data in column 4
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
15 frames
ValueError: cannot safely convert passed user dtype of int32 for object dtyped data in column 4

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/ in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1956             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1957                 e = e.__context__
-> 1958             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1960         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

I did successfully load everything when it was a string, apologies for the confusion. but if i have numeric data with some empty values, is the only way to load them by passing them as string?

Integers in CSV can be loaded as integers in general.

However in your case your CSV contains integers formatted like “1_000” instead of “1000” for example, and pandas doesn’t support it