Strange Error While Attempting to Load DataSet

FDSRashid · September 11, 2023, 8:01pm

Hi all, I’m kind of a beginner with the HF Interface, I was trying to load a 16 MB dataset with arabic characters, and I get the following error: I’m honestly confused what the error is.

0/site-packages/datasets/table.py", line 1833, in wrapper
return func(array, *args, **kwargs)
File “/home/user/.local/lib/python3.10/site-packages/datasets/table.py”, line 2027, in array_cast
return array.cast(pa_type)
File “pyarrow/array.pxi”, line 980, in pyarrow.lib.Array.cast
File “/home/user/.local/lib/python3.10/site-packages/pyarrow/compute.py”, line 403, in cast
return call_function(“cast”, [arr], options, memory_pool)
File “pyarrow/_compute.pyx”, line 572, in pyarrow._compute.call_function
File “pyarrow/_compute.pyx”, line 367, in pyarrow._compute.Function.call
File “pyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow/error.pxi”, line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Failed to parse string: ‘17 - “”’ as a scalar of type int64

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/user/app/app.py”, line 9, in
dataset = load_dataset(‘FDSRashid/hadith_info’,data_files = ‘Basic_Edge_Information.csv’, token = Secret_token, split = ‘train’)
File “/home/user/.local/lib/python3.10/site-packages/datasets/load.py”, line 2153, in load_dataset
builder_instance.download_and_prepare(
File “/home/user/.local/lib/python3.10/site-packages/datasets/builder.py”, line 954, in download_and_prepare
self._download_and_prepare(
File “/home/user/.local/lib/python3.10/site-packages/datasets/builder.py”, line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File “/home/user/.local/lib/python3.10/site-packages/datasets/builder.py”, line 1813, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File “/home/user/.local/lib/python3.10/site-packages/datasets/builder.py”, line 1958, in _prepare_split_single
raise DatasetGenerationError(“An error occurred while generating the dataset”) from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

lhoestq · September 21, 2023, 10:36am

It looks like your dataset has data of incoherent types. There seems to be a column that is loaded as type “int64” but the dataset content 17 - “” can’t be converted to an integer.

Could you share some data samples and the code you used to load the dataset ? That would be helpful to investigate why you end up with this error

FDSRashid · September 24, 2023, 3:30am

this was precisely the error! i simply loaded the dataset using load_dataset('path/to/dataset') , without any modification to the dataset. there were some invalid rows with their values and some null values in the dataset - pyarrow chose the default datatype to be integers. i made a temporary fix by making a column schema and setting the data type of all the columns to string. however this leads me to my second issue, loading in datasets with null values. even when i set the column type to all be string, null values aren’t read in and load_dataset yields an error . now i’m confused on how to read in datasets with null values using the load_datasets() function.

lhoestq · September 24, 2023, 3:39pm

What’s the error messager ? load_dataset should work even if you have null values

FDSRashid · September 24, 2023, 5:40pm

this is my column schema : features = Features({'Book_ID': Value('int32'),'taraf_ID': Value('string'), 'Hadith_ID': Value('string'), 'matn': Value('string'), 'taraf_tally': Value('int32'), 'wordcount': Value('string'), 'Domain': Value('string'), 'Category': Value('string'), 'translation': Value('string')}) . When i try to to load in this dataset using this code : dataset = load_dataset("FDSRashid/hadith_info", data_files = 'All_Matns.csv', token = string1, features = features), i get the following error :

Failed to read file '/root/.cache/huggingface/datasets/downloads/ac7e243c60b61b8decc6fc884b4b76a7d6c12164953ec0f10a672362460a1bcd' with error <class 'ValueError'>: cannot safely convert passed user dtype of int32 for object dtyped data in column 4
ERROR:datasets.packaged_modules.csv.csv:Failed to read file '/root/.cache/huggingface/datasets/downloads/ac7e243c60b61b8decc6fc884b4b76a7d6c12164953ec0f10a672362460a1bcd' with error <class 'ValueError'>: cannot safely convert passed user dtype of int32 for object dtyped data in column 4
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

TypeError: Cannot cast array data from dtype('O') to dtype('int32') according to the rule 'safe'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
15 frames
ValueError: cannot safely convert passed user dtype of int32 for object dtyped data in column 4

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1956             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1957                 e = e.__context__
-> 1958             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1959 
   1960         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

I did successfully load everything when it was a string, apologies for the confusion. but if i have numeric data with some empty values, is the only way to load them by passing them as string?

lhoestq · September 25, 2023, 9:33am

Integers in CSV can be loaded as integers in general.

However in your case your CSV contains integers formatted like “1_000” instead of “1000” for example, and pandas doesn’t support it

numpy01 · March 28, 2025, 1:39pm

good day here.
i don’t konw if i can ask my questions on this discussion, cause i can’t find where to ask my own question.
this is the error am facing

ValueError Traceback (most recent call last)
Cell In[127], line 95
89 callbacks = [
90 EarlyStopping(monitor=‘val_loss’, patience=5, restore_best_weights=True),
91 ReduceLROnPlateau(monitor=‘val_loss’, factor=0.5, patience=3, verbose=1)
92 ]
94 # Train model
—> 95 history = model.fit(
96 train_generator,
97 validation_data=val_generator,
98 epochs=20,
99 callbacks=callbacks
100 )
102 # Plot training curves
103 plt.figure(figsize=(10, 5))

File ~\anaconda3\Lib\site-packages\keras\src\utils\traceback_utils.py:122, in filter_traceback..error_handler(*args, **kwargs)
119 filtered_tb = _process_traceback_frames(e.traceback)
120 # To get the full stack trace, call:
121 # keras.config.disable_traceback_filtering()
→ 122 raise e.with_traceback(filtered_tb) from None
123 finally:
124 del filtered_tb

File ~\anaconda3\Lib\site-packages\keras\src\trainers\data_adapters\py_dataset_adapter.py:295, in PyDatasetAdapter.get_tf_dataset(self)
290 batches = [
291 self._standardize_batch(self.py_dataset[i])
292 for i in range(num_samples)
293 ]
294 if len(batches) == 0:
→ 295 raise ValueError(“The PyDataset has length 0”)
296 self._output_signature = data_adapter_utils.get_tensor_spec(batches)
298 ds = tf.data.Dataset.from_generator(
299 self._get_iterator,
300 output_signature=self._output_signature,
301 )

ValueError: The PyDataset has length 0

John6666 · March 28, 2025, 2:55pm

TensorFlow error?

github.com/tensorflow/tensorflow

Keras fit method not working with dataset iterator

opened 02:27PM - 14 Jun 18 UTC

closed 06:08PM - 21 Jun 18 UTC

sibyjackgrove

### System information - **Have I written custom code (as opposed to using a st…ock example script provided in TensorFlow)**: Custom - **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Windows 10 - **TensorFlow installed from (source or binary)**: Binary - **TensorFlow version (use command below)**: 1.9-rc0 - **Python version**: 3.6 - **Bazel version (if compiling from source)**: NA - **GCC/Compiler version (if compiling from source)**: NA - **CUDA/cuDNN version**: NA - **GPU model and memory**: NA - **Exact command to reproduce**: `model.fit(get_iterator,steps_per_epoch=2,batch_size=2,epochs=2,shuffle =True,verbose=1)` and `model.fit(get_iterator,get_iterator,steps_per_epoch=2,batch_size=2,epochs=2,shuffle =True,verbose=1)` ### Describe the problem When I pass one dataset iterator to `fit` method, I get: > Please provide data as a list or tuple of 2 elements - input and target pair. Received Tensor("IteratorGetNext_4:0", shape=(2, ?), dtype=float32) When I pass two iterators I get the error: > ValueError: You passed a dataset or dataset iterator (<tensorflow.python.data.ops.iterator_ops.Iterator object at 0x000001FEABE88748>) as input `x` to your model. In that case, you should not specify a target (`y`) argument, since the dataset or dataset iterator generates both input data and target data. Received: <tensorflow.python.data.ops.iterator_ops.Iterator object at 0x000001FEABE88748> When I create a new dataset after zipping the original x and y data set and pass that to `fit `I get the error described in https://github.com/tensorflow/tensorflow/issues/19912 According to 1.9-rc0 method release notes iterators should be usable with keras training methods. Please provide a solution or provide clarification in the documentation. ### Source code / logs ``` dataset= tf.contrib.data.make_csv_dataset(file_name,48,select_columns= ['Load_residential_multi_0','Load_residential_multi_1'],shuffle=False) dataset = dataset.map(lambda x: tf.stack(list(x.values()))) get_iterator = dataset.make_one_shot_iterator() get_batch = get_iterator.get_next() #Building and training a single layer model using Keras (Available within TensorFlow) model = Sequential() #Input Layer model.add(InputLayer(input_shape=(48,),name='InputLayer'))#,input_tensor =dataset #model.add(BatchNormalization(axis=1)) #Normalizing values #Layer1 model.add(Dense(units=5,activation='relu',name='FeedForward1')) #Add a feed forward layer #Layer2 model.add(Dense(units=5,activation='relu',name='FeedForward2')) #Add a feed forward layer #Output layer model.add(Dense(units=48,name='OutputLayer')) #Specify los function and optimizer model.compile(loss='mse',optimizer='adam',metrics=['mae']) #Summarize model model.summary() #Train the model model.fit(get_iterator,steps_per_epoch=2,batch_size=2,epochs=2,shuffle =True,verbose=1) #model.fit(get_iterator,get_iterator,steps_per_epoch=2,batch_size=2,epochs=2,shuffle =True,verbose=1) ```

github.com/keras-team/keras

PyDataset Documentation and Best Practices

opened 01:09PM - 21 Aug 24 UTC

dryglicki

type:support

**Keras Version:** 3.5.0 **Tensorflow Version:** 2.17.0 **What I want to do:…** Use PyDataset class in a data distributed environment. --- I would like to ask about the status of PyDataset and some of its best uses and practices. I have a functioning PyDataset class that ingests and processes HDF files: ``` class HDFDataset(K.utils.PyDataset): ''' Keras data loader to replace Tensorflow's Dataset API. Reads HDF5 files. Inputs: file_list: list list of file names, pre-globbed batch_size: int size of batches shuffle: bool whether or not to shuffle the dataset at the end of each epoch lons_lats: bool whether or not to include longitudes and latitudes -- Additional keyword arguments -- workers=1 use_multiprocessing=False max_queue_size=10 ''' def __init__(self, file_list: list | tuple | set, batch_size: int, shuffle: bool = False, lons_lats: bool = False, subsample: bool = False, **kwargs): super(HDFDataset, self).__init__(**kwargs) self.shuffle = shuffle self.batch_size = batch_size self.tmplen = len(self.file_list) self.subsample = subsample if self.subsample: self.slice = slice(64, 192) self.time_slice = slice(0,6) def __len__(self): return self.tmplen // self.batch_size def _extract_data_from_hdf5(self, file_list): input_list = ['priors', 'model'] output_list = ['forecast'] # Preparing input dictionary inputs_dict = {} for name in input_list: new_var = f'input_{name}' inputs_dict[new_var] = [] outputs = [] for f in file_list: with h5py.File(f, 'r') as h5: for k in input_list: new_var = f'input_{k}' if self.subsample: inputs_dict[new_var].append(h5.get(k)[:, self.slice, self.slice, :]) else: inputs_dict[new_var].append(h5.get(k)[...]) for k in output_list: if self.subsample: outputs.append(h5.get(k)[0:6, self.slice, self.slice, :]) else: outputs.append(h5.get(k)[0:6, self.slice, self.slice, :]) for k in input_list: nv = f'input_{k}' inputs_dict[nv] = np.stack(inputs_dict[nv], axis = 0) outputs = np.stack(outputs, axis = 0) return inputs_dict, outputs def __getitem__(self, idx: int): if idx >= self.__len__(): raise StopIteration low = idx * self.batch_size high = min(low + self.batch_size, self.tmplen) inputs, outputs = self._extract_data_from_hdf5(self.file_list[low:high]) return [inputs, outputs] def on_epoch_end(self): if self.shuffle: random.shuffle(self.file_list) # In-place shuffle ``` This works for my case really nicely. It avoids the memory leak nightmare with which I have been dealing by directly trying to use the `tf.data` API (https://github.com/tensorflow/tensorflow/issues/72014) for multiple inputs from the same file. But the documentation on PyDataset stinks! Looking inside the [source code](https://github.com/keras-team/keras/blob/v3.5.0/keras/src/trainers/data_adapters/py_dataset_adapter.py), PyDataset has an Adapter class that will make a Tensorflow data generator. Does this automatically get called during `fit()`? Is it best practice to call the data generator directly so I can distribute the dataset via [TF's experimental distribute dataset function](https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy#experimental_distribute_dataset)? In the source, there is also a `PyDatasetEnqueuer` class. Do I need this? Why is this here? Who is the target audience? Is the expectation of the Enquerer in the PyDataset class also the reason I need to raise a `StopIteration` command in `__getitem__`? Also digging into source, at [this point](https://github.com/keras-team/keras/blob/v3.5.0/keras/src/trainers/data_adapters/py_dataset_adapter.py), the shuffle is hard-coded to 8. That probably needs to go. Anyway, I don't have any specific programming questions here, but I would like to know what best practices are, how do I use `PyDataset` in a (Tensorflow) distributed data environment, and so on.

Topic		Replies	Views
Datasets.load_datasets fails 🤗Datasets	12	785	October 11, 2024
Cannot load dataset on Kaggle 🤗Datasets	4	3129	August 16, 2023
Unable to Load Dataset Using `load_dataset` 🤗Datasets	10	295	March 11, 2025
TypeError: Couldn't cast array of type int64 to null 🤗Datasets	3	109	February 6, 2025
DatasetGenerationError. Failed to parse string: as a scalar of type double Beginners	3	102	January 7, 2025

Strange Error While Attempting to Load DataSet

good day here. i don’t konw if i can ask my questions on this discussion, cause i can’t find where to ask my own question. this is the error am facing

Related topics

good day here.
i don’t konw if i can ask my questions on this discussion, cause i can’t find where to ask my own question.
this is the error am facing