HF Dataset + TensorFlow + Ragged Tensors (Object Detection)

Hi I’m trying to use Datasets to load my data into a TF-DS to use with Keras 3. This seems impossible because of the Ragged Tensors on BBox / ClassLabels? I’m getting the following error:

RuntimeError: Unrecognized array dtype object. 
Nested types and image/audio types are not supported yet.

This is my collate_fn :

    def collate_fn(examples):
        print(examples)
        images, boxes, classes = [], [], []
        for example in examples:
            images.append(tf.convert_to_tensor(example["image"], dtype=tf.float32))
            boxes.append(
                tf.reshape(tf.convert_to_tensor(example["boxes"], dtype=tf.float32), (len(example["boxes"]), 4))
            )
            classes.append(
                tf.reshape(
                    tf.convert_to_tensor(example["classes"], dtype=tf.float32),
                    (
                        len(
                            example["classes"],
                        )
                    ),
                )
            )

        return {"image": tf.stack(images), "boxes": tf.ragged.stack(boxes), "classes": tf.ragged.stack(classes)}

Thanks in advance!

1 Like

It seems that the library version problem is the most common cause. It seems that neither too new nor too old is good.

Thanks! I’ll try this when Im back at work.

But I’d like to note I’m using Keras 3 with TensorFlow, not Transformers.

With tf.keras, it looks like there’s been some changes.

note that you can also do ds = ds.with_format("tf") to get TF tensors / ragged tensors automatically

1 Like

@lhoestq I know about that one, but my understanding is that it puts the tensors in-memory (documentation) which means that my computer will blow up :sweat_smile: It’s a big dataset.

@John6666 I’m not using tf.keras, I’m using keras. It’s the “new” thing where you can use TensorFlow, PyTorch or JAX. There’s currently only support for TensorFlow with the Yolo model, which is a shame - I feel tricked :sweat_smile:

1 Like

Oh sorry.:sweat_smile:

I suggest opening a Discussion somewhere on this, because with this many users, someone will know how to solve it. When a Discussion is opened in one of the community repos, the members will be notified, so they will generally be aware of it.

If you want to solve it yourself, the error message is a very rare guy, and it should only come from this line of code, so I thought that would give you a clue.

to_tf_dataset => dataset._get_output_signature => This error
It looks like it’s only called through this route, but where is the to_tf_dataset called…?

1 Like

it doesn’t load the dataset in memory. Rather it sets the output format of the dataset to TF tensors, si when you you my_dataset[0] the output is automatically formatted as a TF tensor (and in an optimized way from the underlying Arrow data)

2 Likes

I’m using the to_tf_dataset myself, I read these docs: Using Datasets with TensorFlow

Further; I get other problems using with_format(tensorflow).

@lhoestq my dataset is not fully loaded into RAM but read as we grab batches (get_item). Are you sure about this? I’ll do some testing. According to your suggestion this documentation is false? Using Datasets with TensorFlow

I think this doc is just a bit confusing, in particular it mixes “formatting as TF” and “converting to TF” which is not the same thing

  • format as TF in datasets: calling with_format("tf") doesn’t load in RAM, it only sets the output type of the Dataset to TF tensors (but the data still lives on disk and is memory mapped)
  • convert to TF in tf.data: by loading the full data in memory using e.g. tf.data.Dataset.from_tensor_slices()

Would be great to rephrase it a bit to make it clearer though, the docs can be modified here: datasets/docs/source/use_with_tensorflow.mdx at main · huggingface/datasets · GitHub

1 Like

Thanks @lhoestq that clarifies a lot. I’ll close this topic now :slight_smile:

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.