Tensorflow Huggingface Datasets Equivalent to PyTorch

Hello,

Does anyone know the Tensorflow equivalent to something like this in PyTorch:

# Convert the format of the tokenized train dataset to Tensors

train_with_pytorch = tokenized_train_dataset.with_format("torch")

# Convert the format of the tokenized validation dataset to Tensors

eval_with_pytorch = tokenized_eval_dataset.with_format("torch")

# Create the Iterable train dataloader. If the length of a tokenized input sequence is less than 2048 drop it.

train_dataloader = DataLoader(train_with_pytorch, shuffle = True, drop_last = True, collate_fn = default_data_collator, batch_size = 8)

# Create the Iterable validation dataloader. If the length of a tokenized input sequence is less than 2048 drop it.

eval_dataloader = DataLoader(eval_with_pytorch, shuffle = True, drop_last = True, collate_fn = default_data_collator, batch_size = 8)

I have been able to successfully build and integrate multiple Huggingface datasets and data loaders with PyTorch but I am currently having difficulty reproducing the same results in Tensorflow.

I greatly appreciate any help!

Hi!

You can use the to_tf_dataset() function which just got a nice rework. If your elements are all the same length, then the built-in collator will handle it (otherwise you’ll need a custom collator). You can just do:

tf_train = dataset.to_tf_dataset(columns=["input"], 
                                 label_cols=["labels"],
                                 batch_size=8,
                                 shuffle=True)

Check out the docs here for more details about :slight_smile:

1 Like