Problems with Dataset.from_dict() and Feature types

I am building the training pipeline for a Distilbert and am trying to define the Feature types for a Dataset that is loaded from a dictionary.

This dictionary is actually the input_ids, labels and attention_mask fields that the tokenizer returns, and I can’t seem to achieve the correct data assignation, specially the label feature. Here is an example:

data = {
    'input_ids': [[101, 1, 2, 3, 102], [101, 11, 12, 13, 102]],
    'labels': [[-100, 'O', 'B-NOUN', 'L-NOUN', -100], [-100, 'U-NOUN', 'O', 'O', -100]],
    'attention_mask': [[0, 1, 1], [1, 0, 0]]
}

dt = Dataset.from_dict(data)

The problem comes because:

  1. They are Sequences of values, and I am not sure how to build a Sequence[ClassLabel] feature and hence, the above code crashes on .from_dict(data) as it expect int not str.
  2. If I map the label names previously to the .from_dict(data) (such as the code after those lines) it is able to generate the dataset without problem, but later on creates a strange behavior in the training and prediction time for the model. Here Distilbert just builds its own tag2id mapping such as {0: ‘LABEL_0’, 1: ‘LABEL_1’, 2:‘LABEL_2’} that I am not sure how to fix (and this generates problems whith the aggregation strategy on prediction time.

Example of mapped data such as explained in point 2:

data = {
    'input_ids': [[101, 1, 2, 3, 102], [101, 11, 12, 13, 102]],
    'labels': [[-100, 1, 2, 3, -100], [-100, 4, 1, 1, -100]],
    'attention_mask': [[0, 1, 1], [1, 0, 0]]
}

dt = Dataset.from_dict(data)

Any idea how to correctly generate the dataset and avoid that problem with Distilbert?

UPDATE: Decided to try another road and it is basically building first the dataset and then tokenizing, but the problem I got right now is that using the function align_labels_with_mapping crashes when the feature is a Sequence, and casting the feature also was not able to cast the values inside the Sequence object.

What am I doing wrong?

Hi ! You can define a Sequence of ClassLabel with

Sequence(ClassLabel(names=...))

but in this case you need your data to be integers, not strings.
You can convert your data to integers with

data = features.encode_batch(data)

and then get your dataset:

dt = Dataset.from_dict(data, features=features)

Regarding the label2id of DistilBERT, you can define your own by specifying the id2label parameter of the model config when you load your model with .from_pretrained(..., id2label=...)

Could you open an issue on GitHub so that we can take a look into what went wrong ?
If you have a code snippet that reproduces the error that would be of great help as well.