Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering

Interested in fine-tuning on your own custom datasets but unsure how to get going? I just added a tutorial to the docs with several examples that each walk you through downloading a dataset, preprocessing & tokenizing, and training with either Trainer, native PyTorch, or native TensorFlow 2.

Examples include:

  1. Sequence classification (sentiment) – IMDb
  2. Token classification (NER) – W-NUT Emerging and Rare entities
  3. Question answering (span selection) – SQuAD 2.0

Click the Open in Colab button at the top to open a colab notebook in either TF or PT. This tutorial demonstrates one workflow for working with custom datasets, but there are many valid ways to accomplish the same thing. The intention is to be demonstrative rather than definitive. Also, we highly recommend you check out and contribute to our NLP datasets & metrics library for easy access 150+ datasets.

Tutorial: https://huggingface.co/transformers/master/custom_datasets.html

Feedback and questions welcome!


I spotted a minor typo. “…which we can use for for evaluation and tuning without taining our test set results.” I believe you meant to say tainting.

Otherwise, great tutorial. I’m looking forward to digging in more.

1 Like

thanks for contributing!
May I ask if I don’t have any label or relationship between sentences, could I fine tune a bert model by masked language model task?

Of course! At the bottom of the tutorial we actually link to a blog post that shows you how to do just that https://huggingface.co/blog/how-to-train

Thanks a lot for creating the tutorial @joeddav!

I ran into an issue with tokenizer. It seems like I cannot just pass my list of texts to tokenizer like in the tutorial. Am I doing something wrong?

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(test_texts, truncation=True, padding=True)
TypeError                                 Traceback (most recent call last)
<ipython-input-77-d270a8ea6e50> in <module>
----> 1 train_encodings = tokenizer(train_texts, truncation=True, padding=True)
  2 val_encodings = tokenizer(test_texts, truncation=True, padding=True)

TypeError: 'DistilBertTokenizerFast' object is not callable

Hi stefan-jo

what version of transformers are you using? You might need version 3. See this issue https://github.com/huggingface/transformers/issues/5931 , which says that transformers 2.3.0 does not have callable tokenizers.


Yes, that was it. I installed from source, restarted the kernel and now it’s working :slight_smile:

Thanks @rgwatwormhill

1 Like

I have two (very basic) questions:

  1. I suppose in the tutorial the entire model is being fine-tuned at once. Is there an easy way to first train only the classification head and only then unfreeze the entire model?
  2. Is the classification head in BertForSequenceClassification pre-trained or initialized randomly on top of BertModel? If pre-trained, which task/dataset has been used for pre-training?

Note: I’ve been using BERT instead of DistilBERT, but I guess the same applies to both.

For 1, you can look in the training tutorial where there is an example in PyTorch.
For 2, the head is initialized randomly since we are using a checkpoint of the base model, it would be pretrained if we used a checkpoint that has been fine-tuned for sequence classification like distilbert-base-uncased-finetuned-sst-2-english.

1 Like

Thank you for your answer! I’ll check out the tutorial.

Hi @sgugger, @joeddav. In case using the Trainer class in the NER task along with using a compute_metrics function

# def compute_metrics(pred):
#     labels = pred.label_ids
#     preds = pred.predictions.argmax(-1)

#     print(labels.shape == preds.shape)

#     precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
#     # acc = accuracy_score(labels, preds)
#     return {
#         'f1': f1,
#         'precision': precision,
#         'recall': recall
#     }

I always run through an error
ValueError: Classification metrics can't handle a mix of multiclass-multioutput and multilabel-indicator targets
solutions online said that the labels and predictions might not be the same shape, but this is not the case here.the rest of the code is exactly like the tutorial.

Hi there. You should use metrics designed for NER. For instance the package seqeval has some that will work directly. Check the run_ner script to see how it’s used in compute_metrics.

1 Like

Thanks for this tutorial @joeddav. I have reviewed your W-NUT example a few times.
I was wondering if you could point me to an similar example that demonstrates how to add new labels to the classification. For example, I would like to classify address information.

1 Like

Hello everyone,
I’m trying to reproduce the IMDB sentiment analysis model of the tutorial. I already had the data on my personal machine under a slight different form, but after transforming it into a list of strings, each one containing a review, it should be the same. But I get an error. I don’t know if I did something wrong or if the library changed since the tutorial was created.
My code looks like this (the script isn’t finished as it doesn’t contain the evaluation part but the training is already failing):

import os
import pandas as pd
import pickle
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification, Trainer, TrainingArguments
import torch
from torch import cuda
from torch.utils.data import Dataset
import time
from transformers.integrations import TensorBoardCallback

from train_and_eval_lstm import print_evaluation_scores

device = 'cuda' if cuda.is_available() else 'cpu'

def main():
    clean_text_train_fn = os.path.join(os.getcwd(), "Transformed_data/clean_text_train.csv")
    df_clean_text_train = pd.read_csv(clean_text_train_fn)
    clean_text_train = df_clean_text_train["clean_text"].tolist()
    clean_text_valid_fn = os.path.join(os.getcwd(), "Transformed_data/clean_text_valid.csv")
    df_clean_text_valid = pd.read_csv(clean_text_valid_fn)
    clean_text_valid = df_clean_text_valid["clean_text"].tolist()
    clean_text_test_fn = os.path.join(os.getcwd(), "Transformed_data/clean_text_test.csv")
    df_clean_text_test = pd.read_csv(clean_text_test_fn)
    clean_text_test = df_clean_text_test["clean_text"].tolist()

    ## Load binary labels
    y_binary_train_fn = os.path.join(os.getcwd(), 'Transformed_data/Labels/y_binary_train.pkl')
    with open(y_binary_train_fn, mode='rb') as f:
        y_binary_train = pickle.load(f)
    y_binary_valid_fn = os.path.join(os.getcwd(), 'Transformed_data/Labels/y_binary_valid.pkl')
    with open(y_binary_valid_fn, mode='rb') as f:
        y_binary_valid = pickle.load(f)
    y_binary_test_fn = os.path.join(os.getcwd(), 'Transformed_data/Labels/y_binary_test.pkl')
    with open(y_binary_test_fn, mode='rb') as f:
        y_binary_test = pickle.load(f)

    ## Using pretrained Tokenizer
    model_name = 'distilbert-base-uncased'
    tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

    start = time.time()
    train_encodings = tokenizer(clean_text_train, truncation=True, padding=True)
    stop = time.time()
    print(f"Time to tokenize training set: {stop - start}")
    val_encodings = tokenizer(clean_text_valid, truncation=True, padding=True)
    test_encodings = tokenizer(clean_text_test, truncation=True, padding=True)

    class IMDbDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item

        def __len__(self):
            return len(self.labels)

    n_toy = 500
    toy_dataset = IMDbDataset(train_encodings[:n_toy], y_binary_train[:n_toy])
    train_dataset = IMDbDataset(train_encodings, y_binary_train)
    val_dataset = IMDbDataset(val_encodings, y_binary_valid)
    test_dataset = IMDbDataset(test_encodings, y_binary_test)

    training_args = TrainingArguments(
        output_dir='./results',  # output directory
        num_train_epochs=1,  # total number of training epochs
        per_device_train_batch_size=16,  # batch size per device during training
        per_device_eval_batch_size=64,  # batch size for evaluation
        warmup_steps=500,  # number of warmup steps for learning rate scheduler
        weight_decay=0.01,  # strength of weight decay
        logging_dir='./logs',  # directory for storing logs

    model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

    trainer = Trainer(
        model=model,  # the instantiated 🤗 Transformers model to be trained
        args=training_args,  # training arguments, defined above
        train_dataset=toy_dataset,  # training dataset
        eval_dataset=val_dataset,  # evaluation dataset

    start = time.time()
    stop = time.time()

    print(f"Time to train the model: {stop-start}")

    model_dir = os.path.join(os.getcwd(), "Saved_models")

if __name__ == "__main__":

And if I execute it, I receive the following error:

  File "/home/me/Documents/CS_Programming_Machine_Learning/Projects/IMDB_sentiment_analysis_2/Comparison_models/train_and_eval_DistilBERT.py", line 61, in __getitem__
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
AttributeError: 'list' object has no attribute 'items'

If I use debugging, I see that indeed, self.encodings is a python list.

I guess that I can fix it on my own, but I was wondering if I did something wrong or if the docs are outdated (I use version 4.5.1 of HF Transformers).

The problem lies in your added line:

toy_dataset = IMDbDataset(train_encodings[:n_toy], y_binary_train[:n_toy])

The train_encodings is a dictionary (with some added properties that let you take a slice like this) so you should do some thing like

toy_encodings = {k: v[:n_toy] for k, v in train_encodings.items()}

to keep a dictionary.


Thank you very much for your answer sgugger.
Your suggestion solved my problem!
I’m sorry I didn’t realize the source of this issue myself :sweat_smile:

Have a nice day!

1 Like

@joeddav, could you please update the tutorial link? I’m getting 404 error. Thank you :slight_smile:

@joeddav Thank you for the tutorial. I was trying to replicate the finetuning code with a different dataset and it worked. But when I changed the pretrainedmodel from Distilbert to something else like Roberta or XlNet, I got an error in the encoding function.

This is the encoding function:

def encode_tags(tags, encodings):
    labels = [[tag2id[tag] for tag in doc] for doc in tags]
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
        # create an empty array of -100
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)
       # set labels whose first offset position is 0 and the second is not 0
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels

    return encoded_labels

It didn’t throw an error if I use BERT or DISTILBERT as the pretrained model and tokenizer, but if I use some other model in its place - This was the error that I got:

Traceback (most recent call last):
File “huggingFace_NER.py”, line 187, in
train_labels = encode_tags(train_tags, train_encodings)
File “huggingFace_NER.py”, line 70, in encode_tags
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
ValueError: NumPy boolean array indexing assignment cannot assign 100 input values to the 80 output values where the mask is true