I have a trained PyTorch sequence classification model (1 label, 5 classes) and I’d like to apply it in batches to a dataset that has already been tokenized. I only need the predicted label, not the probability distribution.
I have spent several hours reviewing the HuggingFace documentation (Transformers, Datasets, Pipelines), course, GitHub, Discuss, and doing google searches, but it has been disappointing to not be able to find this anywhere - it seems like the most basic example that could be provided. There are dozens of examples of training, fine-tuning, and evaluation, but the only inference examples are applied to single texts at a time.
Is there some simple way to apply a HF model to a dataset? I would really encourage that this be a prominent example provided as a demonstration of the HF technology.
Examples of pages reviewed:
- GitHub - huggingface/datasets: 🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
- HuggingFace datasets library - Overview
I’m being restricted to only 2 links so I can’t include the others.
Code below shows model loading, dataset creation, and 4 different attempts at running inference on the dataset.
Model loading
import transformers
print("Transformers version:", transformers.__version__)
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, pipeline
model_path = str(pyprojroot.here() / "models/deberta-v3")
tokenizer = AutoTokenizer.from_pretrained(model_path)
# num_labels appears to mean num_classes
model = AutoModelForSequenceClassification.from_pretrained(model_path,
problem_type = "single_label_classification",
num_labels = 5)
Dataset, tokenization, dataloader
from datasets import Dataset
raw_dataset = Dataset.from_pandas(my_df[['id', 'text']])
# No label preprocessing - this is purely for inference.
def tokenize(batch):
tokens = tokenizer(batch['text'], truncation = True, padding = True, max_length = 256)
return tokens
# We lose the progress bar when parallelized :/
tokenized_datasets = raw_dataset.map(tokenize, batched = True, num_proc = 6,
# Remove any extra columns to avoid a warning when training, not essential though.
remove_columns = raw_dataset.column_names)
tokenized_datasets.set_format('torch')
from tqdm import tqdm
dataloader = torch.utils.data.DataLoader(tokenized_datasets, batch_size = 8)
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
Four different ways of trying to apply the model to the dataset: 1) trainer, 2) dataloader explicitly moving batch to the device, 3) dataloader skipping the movement of the batch to device, 4) pipeline.
1. Trainer
trainer = Trainer(model)
predictions = trainer.predict(tokenized_datasets)
Result: ValueError: expected sequence of length 103 at dim 1 (got 108)
2. Dataloader w/ batch moved to GPU:
preds = []
for i, batch in enumerate(tqdm(dataloader)):
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
preds.append(outputs)
Result: AttributeError: 'list' object has no attribute 'to'
3. Dataloader without batch moved to GPU.
preds = []
for i, batch in enumerate(tqdm(dataloader)):
outputs = model(**batch)
preds.append(outputs)
Result: AttributeError: 'list' object has no attribute 'size'
4. Pipeline
# device = 0 puts the pipeline on GPU.
pipe = pipeline("text-classification", model = model, tokenizer = tokenizer, device = 0)
# Skip tokenization, since the pipeline wants to do that automatically.
pipe_dataloader = torch.utils.data.DataLoader(raw_dataset, batch_size = 8)
preds = []
for i, batch in enumerate(tqdm(pipe_dataloader)):
outputs = pipe(batch)
preds.append(outputs)