I want to ask if there exists a method that can help me transform a tf.data.dataset to a datasets.dataset.
Hi! We don’t have a dedicated method for converting tf.data.Dataset
to datasets.Dataset
, only the other way around.
So you have two options:
- If the dataset is small and fits in RAM, you can convert the TF dataset to a Python dictionary and then call
datasets.Dataset.from_dict
on it. Another approach that might be easier is to install Tensorflow Datasets and convert the TF dataset to a Pandas DataFrame, on which you can calldatasets.Dataset.from_pandas
:datasets.Dataset.from_pandas(tfds.as_dataframe(tf_dataset))
- If the dataset doesn’t fit in RAM, you can create a simple loading script in which you iterate over the dataset and yield its examples in
_generate_examples
. You can find more info here: Create a dataset loading script.
Hello @mariosasko,
any updates on this? I have my own dataset with images and masks, and I am trying to make it work with the TFSegformerForSemanticSegmentation. I have created a tf.data.Dataset that generates the images and masks from files (filepaths), but when I try to train the model, I run into this error:
Node: 'tf_segformer_for_semantic_segmentation_2/segformer/transpose'
2 root error(s) found.
(0) INVALID_ARGUMENT: transpose expects a vector of size 5. But input(1) is a vector of size 4
[[{{node tf_segformer_for_semantic_segmentation_2/segformer/transpose}}]]
[[tf_segformer_for_semantic_segmentation_2/sparse_categorical_crossentropy/cond/then/_0/tf_segformer_for_semantic_segmentation_2/sparse_categorical_crossentropy/cond/cond/then/_44/tf_segformer_for_semantic_segmentation_2/sparse_categorical_crossentropy/cond/cond/remove_squeezable_dimensions/cond/pivot_t/_102/_2229]]
(1) INVALID_ARGUMENT: transpose expects a vector of size 5. But input(1) is a vector of size 4
[[{{node tf_segformer_for_semantic_segmentation_2/segformer/transpose}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_152606]
Where is that fifth dimension coming from? I have something like this:
list_ds = tf.data.Dataset.from_tensor_slices((image_files, mask_files))
ds = list_ds.map(lambda x, y: tf.py_function(parse_function, [x, y], [tf.float32, tf.int32]), num_parallel_calls=1)
and in the end of the parse_function
, I use:
encoded_inputs = feature_extractor(image, mask, return_tensors="tf")
return encoded_inputs['pixel_values'], encoded_inputs['labels']
So now I am wondering whether I should convert the dataset into datasets.Dataset for better compatibility. I noticed some tips in here, would it be better to write the data into an .arrow-file with the ArrowWriter? And then load it with Dataset.from_file()? Or can I make the tf.data.Dataset work?
I can make a dummy forward pass with the model (49 classes) without errors:
outputs = model(batch[0], batch[1])
print(outputs.loss, outputs.logits.shape)
> tf.Tensor([nan], shape=(1,), dtype=float32) (1, 49, 128, 128)
Although the loss is nan
.
Many thanks!
eppane
Perhaps @Rocketknight1 can help