Dataset object has no attribute `to_tf_dataset`

I am following HuggingFace Course. I am at Fine-tuning a model.
Link: Fine-tuning a pretrained model - Hugging Face Course

I use tokenize_function and map as mentioned in the course to process data.

# define a tokenize function
def Tokenize_function(example):
return tokenizer(example['sentence'], truncation=True)

# tokenize entire data
tokenized_data = raw_data.map(Tokenize_function, batched=True)

I get Dataset object at this point. When I try converting this to a TF dataset object as mentioned in the course, it throws the following error.

# convert to TF dataset
train_data = tokenized_data["train"].to_tf_dataset(
columns = ['attention_mask', 'input_ids', 'token_type_ids'],
label_cols = ['label'],
shuffle = True,
collate_fn = data_collator,
batch_size = 8
)

Output:

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_42/103099799.py in <module>
1 # convert to TF dataset
----> 2 train_data = tokenized_data["train"].to_tf_dataset( \
3 columns = ['attention_mask', 'input_ids', 'token_type_ids'], \
4 label_cols = ['label'], \
5 shuffle = True, \
AttributeError: 'Dataset' object has no attribute 'to_tf_dataset'

When I look for dir(tokenized_data["train"]), there is no method or attribute in the name of to_tf_dataset.

Why do I get this error? And how to clear this?

Please help me.

Hey @rajkumar I believe the to_tf_dataset() method was only added in a recent version of datasets. Could you try upgrading to the latest version and check if the problem persists?

3 Likes

Hi @lewtun. You are absolutely correct. I upgraded transformers and datasets to the latest versions and the issues are resolved.

# upgrade transformers and datasets to latest versions
!pip install --upgrade transformers
!pip install --upgrade datasets

Thanks a lot for your timely reply.

2 Likes

Hello am getting the same error, though using the newer version of transformers and datasets:
UnexpectedStatusException: Error for Training job sample-huggingface-training-2022-02-25-22-41-34: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "AttributeError: ‘Dataset’ object has no attribute ‘to_tf_dataset’
"
These are the versions am using:
sagemaker: 2.77.0
transformers: 4.11.0
tensorflow: 2.7.1
dataset version: 1.18.3

How do I resolve this? Thanks

I am having this problem too (‘DatasetDict’ object has no attribute ‘to_tf_dataset’)
I have tried the following with no solution:

upgrade transformers and datasets to latest versions

!pip install --upgrade transformers
!pip install --upgrade datasets

I will appreciate any help on how to solve this challenge

I found a solution! :blush:

Wow! Looks like many are having similar problems because TensorFlow seems to have difficulty working or integrating with Pandas dataframes to convert to tf.data.Dataset format using to_tf_dataset(), in general. Therefore I followed this solution from Stack Overflow and it worked! I just had to repeat the same steps that I had been doing with Pandas dataframes and tokenize. TensorFlow needs to get this solved instead of having to use:

from datasets import Dataset     
tf_dataset = Dataset.from_pandas(dataframe)

Hope this helps~ Happy Automating! :hugs:

Link: nlp - Attribute error: DatasetDict' object has no attribute 'to_tf_dataset' - Stack Overflow

I also faced the similar problem and though @soundaryasundari’s solution helped a little bit, it is not complete. So, I thought I should share mine.

First, we need to convert the DatasetDict Object to Pandas DataFrame. For that, dataset.set_format() and dataset.with_format() should work. But unfortunately, they don’t. We need to manually convert them as follows:

#Convert the dataset to DataFrame
pd_data = pd.DataFrame(dataset["train"])

#Now, convert it back to the DatasetDict object
from datasets import Dataset
ds_data = Dataset.from_pandas(pd_data)

Now, it can be converted to tf.data object.