Huggingface datasets convert a dataset to pandas and then convert it back

nitempe · February 14, 2022, 3:19pm

I am following this page. I loaded a dataset and converted it to Pandas dataframe and then converted back to a dataset. I was not able to match features and because of that datasets didnt match. How could I set features of the new dataset so that they match the old dataset?

    import pandas as pd
    import datasets
    from transformers import LongformerTokenizerFast, LongformerForSequenceClassification, Trainer, TrainingArguments, LongformerConfig
    import torch.nn as nn
    import torch
    from torch.utils.data import Dataset, DataLoader
    import numpy as np
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
    from tqdm import tqdm
    #import wandb
    import os
    
    train_data_s1, test_data_s1 = datasets.load_dataset('imdb', split =['train[0:500]', 'test[0:500]'], 
                                                 cache_dir='/media/data_files/github/website_tutorials/data')
    
    print (type (train_data_s1))
      #<class 'datasets.arrow_dataset.Dataset'> 

 
    
    #converting to pandas - https://towardsdatascience.com/use-the-datasets-library-of-hugging-face-in-your-next-nlp-project-94e300cca850
    print (type(train_data_s1))
    df_pandas = pd.DataFrame(train_data_s1)
    print (type(df_pandas))


    #<class 'datasets.arrow_dataset.Dataset'>
    #<class 'pandas.core.frame.DataFrame'>
    
    from datasets import Dataset
    import pandas as pd
    
    dataset_from_pandas = Dataset.from_pandas(df_pandas)


    dataset_from_pandas == train_data_s1
    #False
    
    #these match
    print (train_data_s1[0])
    print (dataset_from_pandas[0])
    
     {'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.', 'label': 0}

 


    
    
    #these dont match
    print (train_data_s1.features)
    print (dataset_from_pandas.features)
    
    {'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
    {'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}

####update 1---------------------

I modified my code as below to match features but still couldn’t match two datasets

#https://discuss.huggingface.co/t/how-to-create-custom-classlabels/13650# "basic_sentiment holds values [-1,0,1]
from datasets import ClassLabel
dataset_from_pandas = dataset_from_pandas.cast_column("label", ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None))

#these values match

print (train_data_s1[0])
print (dataset_from_pandas[0])
#{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.', 'label': 0}

#{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it\'s not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn\'t have much of a plot.', 'label': 0}

#features match too
print (train_data_s1.features),
print (dataset_from_pandas.features)

#{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}
#{'text': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['neg', 'pos'], names_file=None, id=None)}

#But both datasets still don’t match
dataset_from_pandas == train_data_s1

`#False`

mariosasko · February 14, 2022, 5:04pm

Hi! Our Dataset class doesn’t define a custom __eq__ at the moment, so dataset_from_pandas == train_data_s1 is False unless these objects point to the same memory address (default __eq__ behavior).

I’ll open a PR to fix this. In the meantime, you can test if the datasets are equal as follows:

def are_datasets_equal(dset1, dset2):
    return dset1.data == dset2.data and dset1.features == dset2.features

loretoparisi · April 29, 2022, 5:16pm

Hey! I have my dataset loaded as

sentences = load_dataset(
     "loretoparisi/tatoeba-sentences",
     data_files=data_files,
     delimiter='\t', 
     column_names=['label', 'text'],
     download_mode="force_redownload")

I then convert to Pandas to remove NaN:

import pandas as pd
df_test = pd.DataFrame( sentences['test'] )
df_test = df_test.dropna()
df_train = pd.DataFrame( sentences['train'] )
df_train = df_train.dropna()

then I convert back to dataset:

from datasets import Dataset
train = Dataset.from_pandas(df_train)
test = Dataset.from_pandas(df_test)

but now I need to tokenize it! Originally I did

from transformers import AutoTokenizer
model_name = 'microsoft/xtremedistil-l6-h256-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    tokens = tokenizer(batch['text'], padding="max_length", truncation=True, max_length=128)
    tokens['label'] = features["label"].str2int(batch['label']) if batch["label"] is not None else None
    return tokens
tokenized_datasets = sentences.map(tokenize, batched=True)

so how convert back train and test to sentences Dataset in order to tokenize both?

mariosasko · May 2, 2022, 1:08pm

Hi @loretoparisi! sentences is an object of type datasets.DatasetDict. You can recreate it as follows:

import datasets
sentences = datasets.DatasetDict(
    {
        "train": train,
        "test": test,
    }
)

loretoparisi · May 6, 2022, 3:35pm

Hello, doing so I get and error

TypeError: Values in `DatasetDict` should of type `Dataset` but got type '<class 'pandas.core.frame.DataFrame'>'

I supposed that sentences['train'] and sentences['test'] were automatically converted to a Dataset instance.

loretoparisi · May 6, 2022, 3:46pm

To fix that I did this way then

import datasets
from datasets import Dataset
import pandas as pd
df_test = pd.read_hdf('df_test.hdf')
df_train = pd.read_hdf('df_train.hdf')
sentences = datasets.DatasetDict(
    {
        "train": Dataset.from_pandas(df_train),
        "test": Dataset.from_pandas(df_test),
    }
)

Topic		Replies	Views
From Pandas Dataframe to Huggingface Dataset Beginners	9	66946	December 20, 2024
Convert a list of dictionaries to hugging face dataset object 🤗Datasets	4	19477	December 7, 2023
Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training 🤗Datasets	1	1775	December 8, 2023
Using Hugging Face dataset class as pytorch class Beginners	3	590	September 29, 2021
Convert dataset to pytorch dataloader 🤗Datasets	3	7035	April 7, 2023

Huggingface datasets convert a dataset to pandas and then convert it back

Related topics