Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering

@joeddav Thank you for the tutorial. I was trying to replicate the finetuning code with a different dataset and it worked. But when I changed the pretrainedmodel from Distilbert to something else like Roberta or XlNet, I got an error in the encoding function.

This is the encoding function:

def encode_tags(tags, encodings):
    labels = [[tag2id[tag] for tag in doc] for doc in tags]
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
        # create an empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)
       # set labels whose first offset position is 0 and the second is not 0
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
        encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels

It didn’t throw an error if I use BERT or DISTILBERT as the pretrained model and tokenizer, but if I use some other model in its place - This was the error that I got:

Traceback (most recent call last):
File “huggingFace_NER.py”, line 187, in
train_labels = encode_tags(train_tags, train_encodings)
File “huggingFace_NER.py”, line 70, in encode_tags
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
ValueError: NumPy boolean array indexing assignment cannot assign 100 input values to the 80 output values where the mask is true