How to convert string labels into ClassLabel classes for custom set in pandas

I am trying to fine tune bert-base-uncased model, but after loading datasets from pandas dataframe I get the following error with the trainer.train():
ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 5]))

I tried to understand the problem and I think it is related to the wrong data type. The following example illustrates this problem:
text = [‘John’, ‘snake’, ‘car’, ‘tree’, ‘cloud’, ‘clerk’, ‘bike’]
labels = [‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘0’, ‘2’]# create Pandas DataFrame
df = pd.DataFrame({‘text’: text, ‘label’: labels})# define data set object
ds = Dataset.from_pandas(df)
ds.features

The last command shows the following:
{‘text’: Value(dtype=‘string’, id=None),
‘label’: Value(dtype=‘string’, id=None)}

While it should be (from the huggingface tutorial)
{‘text’: Value(dtype=‘string’, id=None),
‘label’: ClassLabel(num_classes=5, names=[‘0’, ‘1’, ‘2’, ‘3’, ‘4’], names_file=None, id=None)}

My question is how to convert the ‘label’ that has a string type into a ‘label’ that has the proper ClassLabel type. Tutorials say that one should use the map function, but I could not find any code examples.

Thank you for your help.

hi @Krzysztof,

i think you can get what you want by using the features argument of Dataset.from_pandas:

from datasets import Dataset, Value, ClassLabel, Features

text = ["John", "snake", "car", "tree", "cloud", "clerk", "bike"]
labels = [0,1,2,3,4,0,2]
df = pd.DataFrame({"text": text, "label": labels})# define data set object
features = Features({"text": Value("string"), "label": ClassLabel(num_classes=5, names=[0,1,2,3,4])})
ds = Dataset.from_pandas(df, features=features)
ds.features
# {'text': Value(dtype='string', id=None),
#  'label': ClassLabel(num_classes=5, names=[0, 1, 2, 3, 4], names_file=None, id=None)}
2 Likes

Thank you, it solves my problem

1 Like

Hello,
This gives the following error for some reasons:
from datasets import Dataset, Value, ClassLabel, Features


ValueError Traceback (most recent call last)
/tmp/ipykernel_23/306676310.py in
5 df = pd.DataFrame({“text”: text, “label”: labels})# define data set object
6 features = Features({“text”: Value(“string”), “label”: ClassLabel(num_classes=5, names=[0,1,2,3,4])})
----> 7 ds = Dataset.from_pandas(df, features=features)
8 ds.features

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in from_pandas(cls, df, features, info, split, preserve_index)
782 preserve_index (bool, optional):
783 Whether to store the index as an additional column in the resulting Dataset.
→ 784 The default of None will store the index as a column, except for RangeIndex which is stored as metadata only.
785 Use preserve_index=True to force it to be stored as a column.
786

/opt/conda/lib/python3.7/site-packages/datasets/table.py in from_pandas(cls, *args, **kwargs)
709 “”"
710 Convert pandas.DataFrame to an Arrow Table.
→ 711
712 The column types in the resulting Arrow Table are inferred from the
713 dtypes of the pandas.Series in the DataFrame. In the case of non-object

/opt/conda/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

ValueError: too many values to unpack (expected 2)