Rehi @mariosasko. I did try datasets
version 2.3.0 as well as version 2.5.1 above. Now I am using version 2.6.1.
I still get errors, but different ones. Hence I created a minimal example:
from datasets import ClassLabel, Dataset, Features
from datasets.features.features import Sequence, Value
def test_export_tfrecords():
invoice = Dataset.from_dict({
'id': ['0'],
'words': [['14.99']],
'bboxes': [[[11, 1, 21, 10]]],
'ner_tags': [[0]],
'image_path': ['test.jpg']},
Features({
'id': Value(dtype='string', id=None),
'words': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'bboxes': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
'ner_tags': Sequence(feature=ClassLabel(num_classes=2, names=['B-TOTAL', 'O'], names_file=None, id=None), length=-1, id=None),
'image_path': Value(dtype='string', id=None)
})
)
invoice.set_format("numpy")
invoice.export(f"minimal_invoice.tfrecord")
The output I get:
2022-10-30 10:24:21.155765: W tensorflow/core/framework/op_kernel.cc:1768] INVALID_ARGUMENT: TypeError: only integer scalar arrays can be converted to a scalar index
Traceback (most recent call last):
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
ret = func(*args)
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
return func(*args, **kwargs)
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1035, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4040, in generator
yield serialize_example(ex)
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in serialize_example
feature = {key: _feature(value) for key, value in ex.items()}
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in <dictcomp>
feature = {key: _feature(value) for key, value in ex.items()}
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4008, in _feature
return _int64_feature(values)
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3995, in _int64_feature
return tf.train.Feature(int64_list=tf.train.Int64List(value=values))
TypeError: only integer scalar arrays can be converted to a scalar index
2022-10-30 10:24:21.156561: W tensorflow/core/framework/op_kernel.cc:1780] OP_REQUIRES failed at to_tf_record_op.cc:55 : INVALID_ARGUMENT: TypeError: only integer scalar arrays can be converted to a scalar index
Traceback (most recent call last):
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 271, in __call__
ret = func(*args)
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
return func(*args, **kwargs)
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1035, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4040, in generator
yield serialize_example(ex)
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in serialize_example
feature = {key: _feature(value) for key, value in ex.items()}
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4030, in <dictcomp>
feature = {key: _feature(value) for key, value in ex.items()}
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4008, in _feature
return _int64_feature(values)
File "/home/davef/anaconda3/envs/data310/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3995, in _int64_feature
return tf.train.Feature(int64_list=tf.train.Int64List(value=values))
TypeError: only integer scalar arrays can be converted to a scalar index
[[{{node PyFunc}}]]
I guess the Features
are already defining the correct types, nonetheless I also tried
invoice = invoice.cast_column("bboxes", Sequence(feature=Sequence(feature=Value("int64"))))
before calling invoice.set_format("numpy")
, but invoice.export(f"minimal_invoice.tfrecord")
still yields the same error.
However, if I comment out the two lines containing “bboxes”, the test case passes successfully.