Dataset object has no attribute `to_tf_dataset`

rajkumar · November 20, 2021, 11:06am

I am following HuggingFace Course. I am at Fine-tuning a model.
Link: Fine-tuning a pretrained model - Hugging Face Course

I use tokenize_function and map as mentioned in the course to process data.

# define a tokenize function
def Tokenize_function(example):
return tokenizer(example['sentence'], truncation=True)

# tokenize entire data
tokenized_data = raw_data.map(Tokenize_function, batched=True)

I get Dataset object at this point. When I try converting this to a TF dataset object as mentioned in the course, it throws the following error.

# convert to TF dataset
train_data = tokenized_data["train"].to_tf_dataset(
columns = ['attention_mask', 'input_ids', 'token_type_ids'],
label_cols = ['label'],
shuffle = True,
collate_fn = data_collator,
batch_size = 8
)

Output:

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_42/103099799.py in <module>
1 # convert to TF dataset
----> 2 train_data = tokenized_data["train"].to_tf_dataset( \
3 columns = ['attention_mask', 'input_ids', 'token_type_ids'], \
4 label_cols = ['label'], \
5 shuffle = True, \
AttributeError: 'Dataset' object has no attribute 'to_tf_dataset'

When I look for dir(tokenized_data["train"]), there is no method or attribute in the name of to_tf_dataset.

Why do I get this error? And how to clear this?

Please help me.

lewtun · November 20, 2021, 9:01pm

Hey @rajkumar I believe the to_tf_dataset() method was only added in a recent version of datasets. Could you try upgrading to the latest version and check if the problem persists?

rajkumar · November 21, 2021, 6:53am

Hi @lewtun. You are absolutely correct. I upgraded transformers and datasets to the latest versions and the issues are resolved.

# upgrade transformers and datasets to latest versions
!pip install --upgrade transformers
!pip install --upgrade datasets

Thanks a lot for your timely reply.

BabaYao · February 25, 2022, 11:45pm

Hello am getting the same error, though using the newer version of transformers and datasets:
UnexpectedStatusException: Error for Training job sample-huggingface-training-2022-02-25-22-41-34: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "AttributeError: ‘Dataset’ object has no attribute ‘to_tf_dataset’
"
These are the versions am using:
sagemaker: 2.77.0
transformers: 4.11.0
tensorflow: 2.7.1
dataset version: 1.18.3

How do I resolve this? Thanks

Ayomidejoe · September 30, 2022, 5:26am

I am having this problem too (‘DatasetDict’ object has no attribute ‘to_tf_dataset’)
I have tried the following with no solution:

upgrade transformers and datasets to latest versions

!pip install --upgrade transformers
!pip install --upgrade datasets

I will appreciate any help on how to solve this challenge

soundaryasundari · December 6, 2022, 8:07pm

I found a solution!

Wow! Looks like many are having similar problems because TensorFlow seems to have difficulty working or integrating with Pandas dataframes to convert to tf.data.Dataset format using to_tf_dataset(), in general. Therefore I followed this solution from Stack Overflow and it worked! I just had to repeat the same steps that I had been doing with Pandas dataframes and tokenize. TensorFlow needs to get this solved instead of having to use:

from datasets import Dataset     
tf_dataset = Dataset.from_pandas(dataframe)

Hope this helps~ Happy Automating!

Link: nlp - Attribute error: DatasetDict' object has no attribute 'to_tf_dataset' - Stack Overflow

MUmairAB · July 8, 2023, 1:41pm

I also faced the similar problem and though @soundaryasundari’s solution helped a little bit, it is not complete. So, I thought I should share mine.

First, we need to convert the DatasetDict Object to Pandas DataFrame. For that, dataset.set_format() and dataset.with_format() should work. But unfortunately, they don’t. We need to manually convert them as follows:

#Convert the dataset to DataFrame
pd_data = pd.DataFrame(dataset["train"])

#Now, convert it back to the DatasetDict object
from datasets import Dataset
ds_data = Dataset.from_pandas(pd_data)

Now, it can be converted to tf.data object.

Topic		Replies	Views
Error occuring during usig .to_tf_dataset() 🤗Transformers	6	912	January 3, 2024
Type object 'Dataset' has no attribute 'from_pandas' 🤗Datasets	3	5872	April 17, 2023
Transform a tf.data.dataset to a datasets.dataset? Beginners	3	2381	September 30, 2022
Question NLP course from Huggingface Course	3	656	August 21, 2023
Quick Tour: "Train using Tensorflow" gives `Dataset argument should be a datasets.Dataset` error Beginners	4	1073	May 29, 2023

Dataset object has no attribute `to_tf_dataset`

upgrade transformers and datasets to latest versions

Related topics