Export own dataset with different feature types to TFRecord

dball · September 28, 2022, 12:34pm

I have a local dataset that contains features of type str, of type list(int), and of type list(str). I want to export a TFRecord file from the dataset, and tried

dataset.set_format("numpy", dtype=int)
dataset.export(f"{data_name}.tfrecord")

But that gives me the error message

  File "...datasets/formatting/formatting.py", line 202, in _arrow_array_to_numpy
    return np.array(array, copy=False, **self.np_array_kwargs)

ValueError: invalid literal for int() with base 10: 'foobarstring'

Without the argument dtype=int for dataset.set_format(), I get the error message:

array([856, 965, 911, 973], dtype=int32)] is an np.ndarray with items of dtype int32, which cannot be serialized

due to this type check

I tried with datasets version 2.3.0 and with version 2.5.1.

Is it possible to export my dataset as TFRecord? How?

mariosasko · September 28, 2022, 2:27pm

Hi! Try casting the (list(int)) column of type Sequence(Value("int32")) to Sequence(Value("int64")) to make sure they are formatted as np.int64 arrays before the export.

dball · September 28, 2022, 5:31pm

Thanks for the fast reply, @mariosasko

Calling dataset = dataset.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64")))) before dataset.set_format("numpy") leads to formatting as int64:

dataset.features
{'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), ...}

But calling dataset.export(f"{data_name}.tfrecord") thereafter still leads to a very similar error message: is an np.ndarray with items of dtype int64, which cannot be serialized.

I do not understand why dtype int64 does not take branch

elif values.dtype == np.dtype(int):
                    return _int64_feature(values)

in _feature() in file arrow_dataset.py.

In pdb I only get:

> /home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/tensorflow/python/ops/gen_experimental_dataset_ops.py(1766)dataset_to_tf_record()
-> _result = pywrap_tfe.TFE_Py_FastPathExecute(
(Pdb) l
1761 	  """
1762 	  _ctx = _context._context or _context.context()
1763 	  tld = _ctx._thread_local_data
1764 	  if tld.is_eager:
1765 	    try:
1766 ->	      _result = pywrap_tfe.TFE_Py_FastPathExecute(
1767 	        _ctx, "DatasetToTFRecord", name, input_dataset, filename,
1768 	        compression_type)
1769 	      return _result
1770 	    except _core._NotOkStatusException as e:
1771 	      _ops.raise_from_not_ok_status(e, name)
(Pdb) s
2022-09-28 19:25:20.171984: W tensorflow/core/framework/op_kernel.cc:1768] INVALID_ARGUMENT: ValueError: values=[...] is an np.ndarray with items of dtype int64, which cannot be serialized

File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
    ret = func(*args)

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1035, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3910, in generator
    yield serialize_example(ex)

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3900, in serialize_example
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3900, in <dictcomp>
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3884, in _feature
    raise ValueError(...

dball · September 29, 2022, 3:07pm

Backreference and motivation: Perform data validation (e.g. with Tensorflow Data Validation) on a 🤗 Dataset for NER

mariosasko · September 30, 2022, 1:04pm

Hi again! I did a quick test on a dummy dataset with the same format as yours, and everything works as expected (in datasets 2.5.1). Could you please install the newest version of datasets and perform the cast before the export, as I suggested, and let us know if that helps?

dball · October 30, 2022, 9:33am

Rehi @mariosasko. I did try datasets version 2.3.0 as well as version 2.5.1 above. Now I am using version 2.6.1.

I still get errors, but different ones. Hence I created a minimal example:

from datasets import ClassLabel, Dataset, Features
from datasets.features.features import Sequence, Value

def test_export_tfrecords():
    invoice = Dataset.from_dict({
        'id': ['0'],
        'words': [['14.99']],
        'bboxes': [[[11, 1, 21, 10]]],
        'ner_tags': [[0]],
        'image_path': ['test.jpg']},
        Features({
            'id': Value(dtype='string', id=None),
            'words': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
            'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
            'ner_tags': Sequence(feature=ClassLabel(num_classes=2, names=['B-TOTAL', 'O'], names_file=None, id=None), length=-1, id=None),
            'image_path': Value(dtype='string', id=None)
        })
    )
    invoice.set_format("numpy")
    invoice.export(f"minimal_invoice.tfrecord")

The output I get:

2022-10-30 10:24:21.155765: W tensorflow/core/framework/op_kernel.cc:1768] INVALID_ARGUMENT: TypeError: only integer scalar arrays can be converted to a scalar index
Traceback (most recent call last):

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
    ret = func(*args)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1035, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4040, in generator
    yield serialize_example(ex)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in serialize_example
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in <dictcomp>
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4008, in _feature
    return _int64_feature(values)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3995, in _int64_feature
    return tf.train.Feature(int64_list=tf.train.Int64List(value=values))

TypeError: only integer scalar arrays can be converted to a scalar index


2022-10-30 10:24:21.156561: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at to_tf_record_op.cc:55 : INVALID_ARGUMENT: TypeError: only integer scalar arrays can be converted to a scalar index
Traceback (most recent call last):

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
    ret = func(*args)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1035, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4040, in generator
    yield serialize_example(ex)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in serialize_example
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in <dictcomp>
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4008, in _feature
    return _int64_feature(values)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3995, in _int64_feature
    return tf.train.Feature(int64_list=tf.train.Int64List(value=values))

TypeError: only integer scalar arrays can be converted to a scalar index

	 [[{{node PyFunc}}]]

I guess the Features are already defining the correct types, nonetheless I also tried
invoice = invoice.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64")))) before calling invoice.set_format("numpy"), but invoice.export(f"minimal_invoice.tfrecord") still yields the same error.

However, if I comment out the two lines containing “bboxes”, the test case passes successfully.

sambutle · April 17, 2023, 6:39am

This error message suggests that a function or operation is expecting an integer scalar index (single integer) as an index, but is instead receiving an array or a non-integer value. This can occur when attempting to index or slice an array with a non-integer or non-scalar value. To resolve this error, ensure that the index being used is a single integer and not an array or non-integer value. If working with arrays, consider using integer indexing or slicing methods to access specific elements or subsets of the array.

Topic		Replies	Views
Standard getitem returns wrong data type for arrays 🤗Datasets	2	439	November 17, 2023
TypeError: Couldn't cast array of type int64 while mapping the dataset 🤗Datasets	6	5726	March 22, 2023
Getting list of tensors instead of tensor array after using set_format 🤗Datasets	1	2170	November 30, 2021
TypeError: Couldn't cast array of type int64 to Sequence Models	0	796	August 19, 2022
Setting dataset feature value as numpy array 🤗Datasets	7	8059	November 14, 2023

Export own dataset with different feature types to TFRecord

Related topics