Model inference on tokenized dataset

ck37 · February 17, 2022, 12:28pm

I have a trained PyTorch sequence classification model (1 label, 5 classes) and I’d like to apply it in batches to a dataset that has already been tokenized. I only need the predicted label, not the probability distribution.

I have spent several hours reviewing the HuggingFace documentation (Transformers, Datasets, Pipelines), course, GitHub, Discuss, and doing google searches, but it has been disappointing to not be able to find this anywhere - it seems like the most basic example that could be provided. There are dozens of examples of training, fine-tuning, and evaluation, but the only inference examples are applied to single texts at a time.

Is there some simple way to apply a HF model to a dataset? I would really encourage that this be a prominent example provided as a demonstration of the HF technology.

Examples of pages reviewed:

I’m being restricted to only 2 links so I can’t include the others.

Code below shows model loading, dataset creation, and 4 different attempts at running inference on the dataset.

Model loading

import transformers
print("Transformers version:", transformers.__version__)

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, pipeline
model_path = str(pyprojroot.here() / "models/deberta-v3")
tokenizer = AutoTokenizer.from_pretrained(model_path)

# num_labels appears to mean num_classes
model = AutoModelForSequenceClassification.from_pretrained(model_path,
                                                           problem_type = "single_label_classification",
                                                           num_labels = 5)

Dataset, tokenization, dataloader

from datasets import Dataset
raw_dataset = Dataset.from_pandas(my_df[['id', 'text']])

# No label preprocessing - this is purely for inference.
def tokenize(batch):
    tokens = tokenizer(batch['text'], truncation = True, padding = True, max_length = 256)
    return tokens

# We lose the progress bar when parallelized :/
tokenized_datasets = raw_dataset.map(tokenize, batched = True, num_proc = 6,
            # Remove any extra columns to avoid a warning when training, not essential though.
                                 remove_columns = raw_dataset.column_names)

tokenized_datasets.set_format('torch')

from tqdm import tqdm
dataloader = torch.utils.data.DataLoader(tokenized_datasets, batch_size = 8)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

Four different ways of trying to apply the model to the dataset: 1) trainer, 2) dataloader explicitly moving batch to the device, 3) dataloader skipping the movement of the batch to device, 4) pipeline.

1. Trainer

trainer = Trainer(model)
predictions = trainer.predict(tokenized_datasets)

Result: ValueError: expected sequence of length 103 at dim 1 (got 108)

2. Dataloader w/ batch moved to GPU:

preds = []
for i, batch in enumerate(tqdm(dataloader)):
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    preds.append(outputs)

Result: AttributeError: 'list' object has no attribute 'to'

3. Dataloader without batch moved to GPU.

preds = []
for i, batch in enumerate(tqdm(dataloader)):
    outputs = model(**batch)
    preds.append(outputs)

Result: AttributeError: 'list' object has no attribute 'size'

4. Pipeline

# device = 0 puts the pipeline on GPU.
pipe = pipeline("text-classification", model = model, tokenizer = tokenizer, device = 0)

# Skip tokenization, since the pipeline wants to do that automatically.
pipe_dataloader = torch.utils.data.DataLoader(raw_dataset, batch_size = 8)

preds = []
for i, batch in enumerate(tqdm(pipe_dataloader)):
    outputs = pipe(batch)
    preds.append(outputs)

ck37 · February 18, 2022, 1:32pm

Well it doesn’t seem like I can edit this original post anymore but after testing 4 additional variants here is a version that I believe is working (will be another 10 hours before it finishes and I can confirm):

# device = 0 puts the pipeline on GPU, otherwise it will only use CPU.
pipe = pipeline("text-classification", model = model, tokenizer = tokenizer, device = 0)

# Hide the large number of deprecation warnings.
warnings.filterwarnings("ignore", category = DeprecationWarning)

preds = []
# GPU RAM usage continues to grow through inference :( Something is not being deleted correctly.
for i, outputs in enumerate(tqdm(pipe(KeyDataset(raw_dataset, "text"), batch_size = 128),
                                 total = len(raw_dataset))):
    preds.append(outputs)

I haven’t been able to get a version working using the pretokenized version of the dataset. It’s also unfortunate that GPU RAM usage grows over time, because I have to make sure the batch size keeps enough RAM available to not fail due to out-of-memory near the end of the loop. I don’t know if this could be because outputs needs to be explicitly deleted from GPU RAM or something.

duyduong9htv · March 22, 2023, 7:31pm

I assume you have got this figured out. But these were the 2 methods that I used that worked:

input = tokenizer(['hello is it me ', 'you are looking for'], return_tensors='pt')
preds = model(**input) 
preds

-- 

SequenceClassifierOutput(loss=None, logits=tensor([[0.0244, 0.0353],
        [0.0049, 0.0681]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

From tokenized dataset:

inputs = torch.tensor(tokenized_dataset['train']['input_ids'][:4])
preds = model(torch.tensor(inputs).cuda())
preds
--

SequenceClassifierOutput(loss=None, logits=tensor([[ 2.3646, -1.9984],
        [ 3.9254, -3.2606],
        [ 2.9887, -2.5912],
        [ 0.5792, -0.3113]], device='cuda:0', grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Topic		Replies	Views
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	11242	February 12, 2024
Shape mismatch between labels and logits 🤗Transformers	1	1233	December 27, 2023
Training over an already trained transformer model 🤗Transformers	3	408	January 8, 2023
Python nlp transformers library understanding the methods/functions/properties Beginners	0	457	December 29, 2021
Evaluating Finetuned BERT Model for Sequence Classification Beginners	10	6282	October 25, 2022

Model inference on tokenized dataset

Related Topics