I am building the training pipeline for a Distilbert and am trying to define the Feature types for a Dataset that is loaded from a dictionary.
This dictionary is actually the input_ids, labels and attention_mask fields that the tokenizer returns, and I can’t seem to achieve the correct data assignation, specially the label feature. Here is an example:
data = {
'input_ids': [[101, 1, 2, 3, 102], [101, 11, 12, 13, 102]],
'labels': [[-100, 'O', 'B-NOUN', 'L-NOUN', -100], [-100, 'U-NOUN', 'O', 'O', -100]],
'attention_mask': [[0, 1, 1], [1, 0, 0]]
}
dt = Dataset.from_dict(data)
The problem comes because:
- They are Sequences of values, and I am not sure how to build a Sequence[ClassLabel] feature and hence, the above code crashes on .from_dict(data) as it expect int not str.
- If I map the label names previously to the .from_dict(data) (such as the code after those lines) it is able to generate the dataset without problem, but later on creates a strange behavior in the training and prediction time for the model. Here Distilbert just builds its own tag2id mapping such as {0: ‘LABEL_0’, 1: ‘LABEL_1’, 2:‘LABEL_2’} that I am not sure how to fix (and this generates problems whith the aggregation strategy on prediction time.
Example of mapped data such as explained in point 2:
data = {
'input_ids': [[101, 1, 2, 3, 102], [101, 11, 12, 13, 102]],
'labels': [[-100, 1, 2, 3, -100], [-100, 4, 1, 1, -100]],
'attention_mask': [[0, 1, 1], [1, 0, 0]]
}
dt = Dataset.from_dict(data)
Any idea how to correctly generate the dataset and avoid that problem with Distilbert?
UPDATE: Decided to try another road and it is basically building first the dataset and then tokenizing, but the problem I got right now is that using the function align_labels_with_mapping crashes when the feature is a Sequence, and casting the feature also was not able to cast the values inside the Sequence object.
What am I doing wrong?