Use tf.data.Data with HuggingFace datasets

I was going through this tutorial Using a Dataset with PyTorch/Tensorflow — datasets 1.5.0 documentation .
The example s for PyTorch.
Do we have the same for Tensorflow?

Well there’s a section for tensorflow, on the top right corner there’s a split for tensorflow or pytorch, default is in pytorch

This is was took from the official documentation, this is for tensorflow btw

>>> import tensorflow as tf
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> dataset = load_dataset('glue', 'mrpc', split='train')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
>>> dataset = dataset.map(lambda e: tokenizer(e['sentence1'], truncation=True, padding='max_length'), batched=True)
>>>
>>> dataset.set_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
>>> features = {x: dataset[x].to_tensor(default_value=0, shape=[None, tokenizer.model_max_length]) for x in ['input_ids', 'token_type_ids', 'attention_mask']}
>>> tfdataset = tf.data.Dataset.from_tensor_slices((features, dataset["label"])).batch(32)
>>> next(iter(tfdataset))
({'input_ids': <tf.Tensor: shape=(32, 512), dtype=int32, numpy=
array([[  101,  7277,  2180, ...,   
1 Like

Thanks alot :slight_smile:

1 Like