Tensorflow Huggingface Datasets Equivalent to PyTorch

conceptofmind · June 16, 2022, 3:25am

Hello,

Does anyone know the Tensorflow equivalent to something like this in PyTorch:

# Convert the format of the tokenized train dataset to Tensors

train_with_pytorch = tokenized_train_dataset.with_format("torch")

# Convert the format of the tokenized validation dataset to Tensors

eval_with_pytorch = tokenized_eval_dataset.with_format("torch")

# Create the Iterable train dataloader. If the length of a tokenized input sequence is less than 2048 drop it.

train_dataloader = DataLoader(train_with_pytorch, shuffle = True, drop_last = True, collate_fn = default_data_collator, batch_size = 8)

# Create the Iterable validation dataloader. If the length of a tokenized input sequence is less than 2048 drop it.

eval_dataloader = DataLoader(eval_with_pytorch, shuffle = True, drop_last = True, collate_fn = default_data_collator, batch_size = 8)

I have been able to successfully build and integrate multiple Huggingface datasets and data loaders with PyTorch but I am currently having difficulty reproducing the same results in Tensorflow.

I greatly appreciate any help!

stevhliu · June 16, 2022, 4:10pm

Hi!

You can use the to_tf_dataset() function which just got a nice rework. If your elements are all the same length, then the built-in collator will handle it (otherwise you’ll need a custom collator). You can just do:

tf_train = dataset.to_tf_dataset(columns=["input"], 
                                 label_cols=["labels"],
                                 batch_size=8,
                                 shuffle=True)

Check out the docs here for more details about

conceptofmind · June 27, 2022, 3:36pm

Hi @stevhliu ,

I appreciate the response.

I should have been more explicit. I had already reviewed the documentation but it was not clear to me whether everything that was available in tf.data.Dataset could be used interchangeably with Huggingface’s to_tf_dataset() or not.

For example to drop the last batch in TensorFlow:

dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3, drop_remainder=True)

Would this be the same thing in Huggingface Streaming Dataloaders?

tf_train = dataset.to_tf_dataset(columns=["input"], 
                                 label_cols=["labels"],
                                 batch_size=8,
                                 shuffle=True,
                                 drop_remainder = True
                                 )

Thank you,

Enrico

Topic		Replies	Views
Use tf.data.Data with HuggingFace datasets 🤗Transformers	2	2638	March 22, 2021
HuggingFace dataset: each element in list of batch should be of equal size 🤗Datasets	3	10389	August 10, 2023
Convert dataset to pytorch dataloader 🤗Datasets	3	7101	April 7, 2023
Dataloader time problem on custom dataset based on huggingface Beginners	2	1029	June 14, 2022
Using Hugging Face dataset class as pytorch class Beginners	3	593	September 29, 2021

Tensorflow Huggingface Datasets Equivalent to PyTorch

Related topics