Merge custom dataset with dataset on Huggingface : problem with features

genton · April 20, 2024, 12:03pm

Hi,

I’m pretty new to Huggingface and have some troubles merging two datasets.

I’m trying to add some samples to this dataset, especially adding more training example :

The features of this dataset are :
{‘id’: Value(dtype=‘string’, id=None),
‘annotators’: Sequence(feature={‘label’: ClassLabel(names=[‘hatespeech’, ‘normal’, ‘offensive’], id=None), ‘annotator_id’: Value(dtype=‘int32’, id=None), ‘target’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}, length=-1, id=None),
‘rationales’: Sequence(feature=Sequence(feature=Value(dtype=‘int32’, id=None), length=-1, id=None), length=-1, id=None),
‘post_tokens’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}

However, the data I am trying to add are not on Hugginface but from another paper on hatespeech detection. So I tried to obtain the same features in order to concatenate the two datasets together but I am only able to obtain the following features :
{‘post_id’: Value(dtype=‘string’, id=None),
‘annotators’: [{‘annotator_id’: Value(dtype=‘null’, id=None),
‘label’: Value(dtype=‘string’, id=None),
‘target’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}],
‘rationales’: Sequence(feature=Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None), length=-1, id=None),
‘post_tokens’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}

As you can see, the feature type of ‘annotators’ does not correspond, so I can not concatenate them.
I tried using cast_column to change the feature type, of the keys individually :
{‘post_id’: Value(dtype=‘string’, id=None),
‘annotators_label’: ClassLabel(names=[‘hatespeech’, ‘normal’, ‘offensive’], id=None),
‘annotators_id’: Value(dtype=‘int32’, id=None),
‘annotators_target’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None),
‘rationales’: Sequence(feature=Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None), length=-1, id=None),
‘post_tokens’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}

And I tried to put them together in a dictionnary and then putting them in a list (as in the desciption of the data in the github) and then specify the wanted feature type when adding a new column to the dataset but that doesn’t work neither.

I basically tried everything that I know from Huggingface (which is not a lot I concedes), but I really just don’t manage to obtain something with the good feature for ‘annotators’ key:
‘annotators’: Sequence(feature={‘label’: ClassLabel(names=[‘hatespeech’, ‘normal’, ‘offensive’], id=None), ‘annotator_id’: Value(dtype=‘int32’, id=None), ‘target’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}, length=-1, id=None)

Does anyone have an idea how I can manage to do this ?
Thanks for your help !

Topic		Replies	Views
Issue concatenating datasets 🤗Datasets	3	4542	January 3, 2023
Union types, or features that can be multiple types 🤗Datasets	1	769	June 30, 2022
'list' as a feature in huggingface dataset 🤗Datasets	1	1145	May 25, 2023
What is the proper way of handling multiple features in Huggingface? Beginners	0	285	August 13, 2022
Problem with Hugging face customised SQuad dataset Beginners	4	29	January 21, 2025

Merge custom dataset with dataset on Huggingface : problem with features

Related topics