Hi,
I’m pretty new to Huggingface and have some troubles merging two datasets.
I’m trying to add some samples to this dataset, especially adding more training example :
The features of this dataset are :
{‘id’: Value(dtype=‘string’, id=None),
‘annotators’: Sequence(feature={‘label’: ClassLabel(names=[‘hatespeech’, ‘normal’, ‘offensive’], id=None), ‘annotator_id’: Value(dtype=‘int32’, id=None), ‘target’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}, length=-1, id=None),
‘rationales’: Sequence(feature=Sequence(feature=Value(dtype=‘int32’, id=None), length=-1, id=None), length=-1, id=None),
‘post_tokens’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}
However, the data I am trying to add are not on Hugginface but from another paper on hatespeech detection. So I tried to obtain the same features in order to concatenate the two datasets together but I am only able to obtain the following features :
{‘post_id’: Value(dtype=‘string’, id=None),
‘annotators’: [{‘annotator_id’: Value(dtype=‘null’, id=None),
‘label’: Value(dtype=‘string’, id=None),
‘target’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}],
‘rationales’: Sequence(feature=Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None), length=-1, id=None),
‘post_tokens’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}
As you can see, the feature type of ‘annotators’ does not correspond, so I can not concatenate them.
I tried using cast_column to change the feature type, of the keys individually :
{‘post_id’: Value(dtype=‘string’, id=None),
‘annotators_label’: ClassLabel(names=[‘hatespeech’, ‘normal’, ‘offensive’], id=None),
‘annotators_id’: Value(dtype=‘int32’, id=None),
‘annotators_target’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None),
‘rationales’: Sequence(feature=Sequence(feature=Value(dtype=‘int64’, id=None), length=-1, id=None), length=-1, id=None),
‘post_tokens’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}
And I tried to put them together in a dictionnary and then putting them in a list (as in the desciption of the data in the github) and then specify the wanted feature type when adding a new column to the dataset but that doesn’t work neither.
I basically tried everything that I know from Huggingface (which is not a lot I concedes), but I really just don’t manage to obtain something with the good feature for ‘annotators’ key:
‘annotators’: Sequence(feature={‘label’: ClassLabel(names=[‘hatespeech’, ‘normal’, ‘offensive’], id=None), ‘annotator_id’: Value(dtype=‘int32’, id=None), ‘target’: Sequence(feature=Value(dtype=‘string’, id=None), length=-1, id=None)}, length=-1, id=None)
Does anyone have an idea how I can manage to do this ?
Thanks for your help !