Model inference on tokenized dataset

I have a trained PyTorch sequence classification model (1 label, 5 classes) and I’d like to apply it in batches to a dataset that has already been tokenized. I only need the predicted label, not the probability distribution.

I have spent several hours reviewing the HuggingFace documentation (Transformers, Datasets, Pipelines), course, GitHub, Discuss, and doing google searches, but it has been disappointing to not be able to find this anywhere - it seems like the most basic example that could be provided. There are dozens of examples of training, fine-tuning, and evaluation, but the only inference examples are applied to single texts at a time.

Is there some simple way to apply a HF model to a dataset? I would really encourage that this be a prominent example provided as a demonstration of the HF technology.

Examples of pages reviewed:

I’m being restricted to only 2 links so I can’t include the others.

Code below shows model loading, dataset creation, and 4 different attempts at running inference on the dataset.

Model loading

import transformers
print("Transformers version:", transformers.__version__)

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, pipeline
model_path = str( / "models/deberta-v3")
tokenizer = AutoTokenizer.from_pretrained(model_path)

# num_labels appears to mean num_classes
model = AutoModelForSequenceClassification.from_pretrained(model_path,
                                                           problem_type = "single_label_classification",
                                                           num_labels = 5)

Dataset, tokenization, dataloader

from datasets import Dataset
raw_dataset = Dataset.from_pandas(my_df[['id', 'text']])

# No label preprocessing - this is purely for inference.
def tokenize(batch):
    tokens = tokenizer(batch['text'], truncation = True, padding = True, max_length = 256)
    return tokens

# We lose the progress bar when parallelized :/
tokenized_datasets =, batched = True, num_proc = 6,
            # Remove any extra columns to avoid a warning when training, not essential though.
                                 remove_columns = raw_dataset.column_names)


from tqdm import tqdm
dataloader =, batch_size = 8)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

Four different ways of trying to apply the model to the dataset: 1) trainer, 2) dataloader explicitly moving batch to the device, 3) dataloader skipping the movement of the batch to device, 4) pipeline.

1. Trainer

trainer = Trainer(model)
predictions = trainer.predict(tokenized_datasets)

Result: ValueError: expected sequence of length 103 at dim 1 (got 108)

2. Dataloader w/ batch moved to GPU:

preds = []
for i, batch in enumerate(tqdm(dataloader)):
    batch = {k: for k, v in batch.items()}
    outputs = model(**batch)

Result: AttributeError: 'list' object has no attribute 'to'

3. Dataloader without batch moved to GPU.

preds = []
for i, batch in enumerate(tqdm(dataloader)):
    outputs = model(**batch)

Result: AttributeError: 'list' object has no attribute 'size'

4. Pipeline

# device = 0 puts the pipeline on GPU.
pipe = pipeline("text-classification", model = model, tokenizer = tokenizer, device = 0)

# Skip tokenization, since the pipeline wants to do that automatically.
pipe_dataloader =, batch_size = 8)

preds = []
for i, batch in enumerate(tqdm(pipe_dataloader)):
    outputs = pipe(batch)
1 Like

Well it doesn’t seem like I can edit this original post anymore but after testing 4 additional variants here is a version that I believe is working (will be another 10 hours before it finishes and I can confirm):

# device = 0 puts the pipeline on GPU, otherwise it will only use CPU.
pipe = pipeline("text-classification", model = model, tokenizer = tokenizer, device = 0)

# Hide the large number of deprecation warnings.
warnings.filterwarnings("ignore", category = DeprecationWarning)

preds = []
# GPU RAM usage continues to grow through inference :( Something is not being deleted correctly.
for i, outputs in enumerate(tqdm(pipe(KeyDataset(raw_dataset, "text"), batch_size = 128),
                                 total = len(raw_dataset))):

I haven’t been able to get a version working using the pretokenized version of the dataset. It’s also unfortunate that GPU RAM usage grows over time, because I have to make sure the batch size keeps enough RAM available to not fail due to out-of-memory near the end of the loop. I don’t know if this could be because outputs needs to be explicitly deleted from GPU RAM or something.

1 Like