Tensorflow Huggingface Datasets Equivalent to PyTorch

Hello,

Does anyone know the Tensorflow equivalent to something like this in PyTorch:

# Convert the format of the tokenized train dataset to Tensors

train_with_pytorch = tokenized_train_dataset.with_format("torch")

# Convert the format of the tokenized validation dataset to Tensors

eval_with_pytorch = tokenized_eval_dataset.with_format("torch")

# Create the Iterable train dataloader. If the length of a tokenized input sequence is less than 2048 drop it.

train_dataloader = DataLoader(train_with_pytorch, shuffle = True, drop_last = True, collate_fn = default_data_collator, batch_size = 8)

# Create the Iterable validation dataloader. If the length of a tokenized input sequence is less than 2048 drop it.

eval_dataloader = DataLoader(eval_with_pytorch, shuffle = True, drop_last = True, collate_fn = default_data_collator, batch_size = 8)

I have been able to successfully build and integrate multiple Huggingface datasets and data loaders with PyTorch but I am currently having difficulty reproducing the same results in Tensorflow.

I greatly appreciate any help!

Hi!

You can use the to_tf_dataset() function which just got a nice rework. If your elements are all the same length, then the built-in collator will handle it (otherwise you’ll need a custom collator). You can just do:

tf_train = dataset.to_tf_dataset(columns=["input"], 
                                 label_cols=["labels"],
                                 batch_size=8,
                                 shuffle=True)

Check out the docs here for more details about :slight_smile:

1 Like

Hi @stevhliu ,

I appreciate the response.

I should have been more explicit. I had already reviewed the documentation but it was not clear to me whether everything that was available in tf.data.Dataset could be used interchangeably with Huggingface’s to_tf_dataset() or not.

For example to drop the last batch in TensorFlow:

dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3, drop_remainder=True)

Would this be the same thing in Huggingface Streaming Dataloaders?

tf_train = dataset.to_tf_dataset(columns=["input"], 
                                 label_cols=["labels"],
                                 batch_size=8,
                                 shuffle=True,
                                 drop_remainder = True
                                 )

Thank you,

Enrico