Transform a to a datasets.dataset?

I want to ask if there exists a method that can help me transform a to a datasets.dataset.

Hi! We don’t have a dedicated method for converting to datasets.Dataset, only the other way around.

So you have two options:

  • If the dataset is small and fits in RAM, you can convert the TF dataset to a Python dictionary and then call datasets.Dataset.from_dict on it. Another approach that might be easier is to install Tensorflow Datasets and convert the TF dataset to a Pandas DataFrame, on which you can call datasets.Dataset.from_pandas: datasets.Dataset.from_pandas(tfds.as_dataframe(tf_dataset))
  • If the dataset doesn’t fit in RAM, you can create a simple loading script in which you iterate over the dataset and yield its examples in _generate_examples. You can find more info here: Create a dataset loading script.

Hello @mariosasko,

any updates on this? I have my own dataset with images and masks, and I am trying to make it work with the TFSegformerForSemanticSegmentation. I have created a that generates the images and masks from files (filepaths), but when I try to train the model, I run into this error:

Node: 'tf_segformer_for_semantic_segmentation_2/segformer/transpose'
2 root error(s) found.
  (0) INVALID_ARGUMENT:  transpose expects a vector of size 5. But input(1) is a vector of size 4
	 [[{{node tf_segformer_for_semantic_segmentation_2/segformer/transpose}}]]
  (1) INVALID_ARGUMENT:  transpose expects a vector of size 5. But input(1) is a vector of size 4
	 [[{{node tf_segformer_for_semantic_segmentation_2/segformer/transpose}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_152606]

Where is that fifth dimension coming from? I have something like this:

list_ds =, mask_files))
ds = x, y: tf.py_function(parse_function, [x, y], [tf.float32, tf.int32]), num_parallel_calls=1)

and in the end of the parse_function, I use:

 encoded_inputs = feature_extractor(image, mask, return_tensors="tf")   
 return encoded_inputs['pixel_values'], encoded_inputs['labels']

So now I am wondering whether I should convert the dataset into datasets.Dataset for better compatibility. I noticed some tips in here, would it be better to write the data into an .arrow-file with the ArrowWriter? And then load it with Dataset.from_file()? Or can I make the work?

I can make a dummy forward pass with the model (49 classes) without errors:

outputs = model(batch[0], batch[1])
print(outputs.loss, outputs.logits.shape)

> tf.Tensor([nan], shape=(1,), dtype=float32) (1, 49, 128, 128)

Although the loss is nan.

Many thanks!

Perhaps @Rocketknight1 can help