Hey @rajkumar I believe the to_tf_dataset() method was only added in a recent version of datasets. Could you try upgrading to the latest version and check if the problem persists?
Hello am getting the same error, though using the newer version of transformers and datasets:
UnexpectedStatusException: Error for Training job sample-huggingface-training-2022-02-25-22-41-34: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage "AttributeError: ‘Dataset’ object has no attribute ‘to_tf_dataset’
"
These are the versions am using:
sagemaker: 2.77.0
transformers: 4.11.0
tensorflow: 2.7.1
dataset version: 1.18.3
Wow! Looks like many are having similar problems because TensorFlow seems to have difficulty working or integrating with Pandas dataframes to convert to tf.data.Dataset format using to_tf_dataset(), in general. Therefore I followed this solution from Stack Overflow and it worked! I just had to repeat the same steps that I had been doing with Pandas dataframes and tokenize. TensorFlow needs to get this solved instead of having to use:
from datasets import Dataset
tf_dataset = Dataset.from_pandas(dataframe)
I also faced the similar problem and though @soundaryasundari’s solution helped a little bit, it is not complete. So, I thought I should share mine.
First, we need to convert the DatasetDict Object to Pandas DataFrame. For that, dataset.set_format() and dataset.with_format() should work. But unfortunately, they don’t. We need to manually convert them as follows:
#Convert the dataset to DataFrame
pd_data = pd.DataFrame(dataset["train"])
#Now, convert it back to the DatasetDict object
from datasets import Dataset
ds_data = Dataset.from_pandas(pd_data)