Export own dataset with different feature types to TFRecord

I have a local dataset that contains features of type str, of type list(int), and of type list(str). I want to export a TFRecord file from the dataset, and tried

dataset.set_format("numpy", dtype=int)
dataset.export(f"{data_name}.tfrecord")

But that gives me the error message

  File "...datasets/formatting/formatting.py", line 202, in _arrow_array_to_numpy
    return np.array(array, copy=False, **self.np_array_kwargs)

ValueError: invalid literal for int() with base 10: 'foobarstring'

Without the argument dtype=int for dataset.set_format(), I get the error message:

array([856, 965, 911, 973], dtype=int32)] is an np.ndarray with items of dtype int32, which cannot be serialized

due to this type check

I tried with datasets version 2.3.0 and with version 2.5.1.

Is it possible to export my dataset as TFRecord? How?

Hi! Try casting the (list(int)) column of type Sequence(Value("int32")) to Sequence(Value("int64")) to make sure they are formatted as np.int64 arrays before the export.

Thanks for the fast reply, @mariosasko :slight_smile:

Calling dataset = dataset.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64")))) before dataset.set_format("numpy") leads to formatting as int64:

dataset.features
{'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None), ...}

But calling dataset.export(f"{data_name}.tfrecord") thereafter still leads to a very similar error message: is an np.ndarray with items of dtype int64, which cannot be serialized.

I do not understand why dtype int64 does not take branch

elif values.dtype == np.dtype(int):
                    return _int64_feature(values)

in _feature() in file arrow_dataset.py.

In pdb I only get:

> /home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/tensorflow/python/ops/gen_experimental_dataset_ops.py(1766)dataset_to_tf_record()
-> _result = pywrap_tfe.TFE_Py_FastPathExecute(
(Pdb) l
1761 	  """
1762 	  _ctx = _context._context or _context.context()
1763 	  tld = _ctx._thread_local_data
1764 	  if tld.is_eager:
1765 	    try:
1766 ->	      _result = pywrap_tfe.TFE_Py_FastPathExecute(
1767 	        _ctx, "DatasetToTFRecord", name, input_dataset, filename,
1768 	        compression_type)
1769 	      return _result
1770 	    except _core._NotOkStatusException as e:
1771 	      _ops.raise_from_not_ok_status(e, name)
(Pdb) s
2022-09-28 19:25:20.171984: W tensorflow/core/framework/op_kernel.cc:1768] INVALID_ARGUMENT: ValueError: values=[...] is an np.ndarray with items of dtype int64, which cannot be serialized

File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
    ret = func(*args)

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1035, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3910, in generator
    yield serialize_example(ex)

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3900, in serialize_example
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3900, in <dictcomp>
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data309/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3884, in _feature
    raise ValueError(...

Backreference and motivation: Perform data validation (e.g. with Tensorflow Data Validation) on a 🤗 Dataset for NER

Hi again! I did a quick test on a dummy dataset with the same format as yours, and everything works as expected (in datasets 2.5.1). Could you please install the newest version of datasets and perform the cast before the export, as I suggested, and let us know if that helps?

Rehi @mariosasko. I did try datasets version 2.3.0 as well as version 2.5.1 above. Now I am using version 2.6.1.

I still get errors, but different ones. Hence I created a minimal example:

from datasets import ClassLabel, Dataset, Features
from datasets.features.features import Sequence, Value

def test_export_tfrecords():
    invoice = Dataset.from_dict({
        'id': ['0'],
        'words': [['14.99']],
        'bboxes': [[[11, 1, 21, 10]]],
        'ner_tags': [[0]],
        'image_path': ['test.jpg']},
        Features({
            'id': Value(dtype='string', id=None),
            'words': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
            'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
            'ner_tags': Sequence(feature=ClassLabel(num_classes=2, names=['B-TOTAL', 'O'], names_file=None, id=None), length=-1, id=None),
            'image_path': Value(dtype='string', id=None)
        })
    )
    invoice.set_format("numpy")
    invoice.export(f"minimal_invoice.tfrecord")

The output I get:

2022-10-30 10:24:21.155765: W tensorflow/core/framework/op_kernel.cc:1768] INVALID_ARGUMENT: TypeError: only integer scalar arrays can be converted to a scalar index
Traceback (most recent call last):

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
    ret = func(*args)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1035, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4040, in generator
    yield serialize_example(ex)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in serialize_example
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in <dictcomp>
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4008, in _feature
    return _int64_feature(values)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3995, in _int64_feature
    return tf.train.Feature(int64_list=tf.train.Int64List(value=values))

TypeError: only integer scalar arrays can be converted to a scalar index


2022-10-30 10:24:21.156561: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at to_tf_record_op.cc:55 : INVALID_ARGUMENT: TypeError: only integer scalar arrays can be converted to a scalar index
Traceback (most recent call last):

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
    ret = func(*args)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
    return func(*args, **kwargs)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1035, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4040, in generator
    yield serialize_example(ex)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in serialize_example
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in <dictcomp>
    feature = {key: _feature(value) for key, value in ex.items()}

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4008, in _feature
    return _int64_feature(values)

  File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3995, in _int64_feature
    return tf.train.Feature(int64_list=tf.train.Int64List(value=values))

TypeError: only integer scalar arrays can be converted to a scalar index

	 [[{{node PyFunc}}]]

I guess the Features are already defining the correct types, nonetheless I also tried
invoice = invoice.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64")))) before calling invoice.set_format("numpy"), but invoice.export(f"minimal_invoice.tfrecord") still yields the same error.

However, if I comment out the two lines containing “bboxes”, the test case passes successfully.