Auto Model for Sequence Classification take more than 20 minutes to classify a single sequence

I am using a RTX 4090. I would imagine that sequence classification would be rather fast on it but apparently no. It takes MORE than 20 mins to run on a single sequence of 4k tokens (may or may not contain padding tokens). So calling the predict function on a single row of data takes more than 40 mins. This would just take forever to run on the test dataset i have.

I am using the code below. The model is Mistral 7b.

import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "my trained seq classification model based on Mistral"
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    num_labels=1,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

def predict(row):
    prompt = row.question
    chosen = f"Question: {prompt}\nAnswer: {row.chosen.strip()}"
    rejected = f"Question: {prompt}\nAnswer: {row.rejected.strip()}"
    with torch.no_grad():
        rewards_chosen = model(
                    **tokenizer(chosen, return_tensors='pt')
                ).logits
        print('reward chosen is ', rewards_chosen)

        rewards_rejected = model(
                   **tokenizer(rejected, return_tensors='pt')
                ).logits

        print('reward rejected is ', rewards_rejected)
        loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
        print(f"loss is {loss}")
        return (rewards_chosen.item(), rewards_rejected.item(), loss, rewards_chosen>rewards_rejected)

What could be the problem that it takes so long to run the inference? And how do i fix it?

Hmm, seems like you are not using GPU for calculations. Can you try to add .to("cuda")?

import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "my trained seq classification model based on Mistral"
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    num_labels=1,
).to("cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_id)

def predict(row):
    prompt = row.question
    chosen = f"Question: {prompt}\nAnswer: {row.chosen.strip()}"
    rejected = f"Question: {prompt}\nAnswer: {row.rejected.strip()}"
    with torch.no_grad():
        inputs_chosen = tokenizer(chosen, return_tensors='pt').to("cuda")
        rewards_chosen = model(**inputs_chosen).logits
        print('reward chosen is ', rewards_chosen)

        inputs_rejected = tokenizer(chosen, return_tensors='pt').to("cuda")
       rewards_rejected = model(**inputs_rejected).logits

        print('reward rejected is ', rewards_rejected)
        loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
        print(f"loss is {loss}")
        return (rewards_chosen.item(), rewards_rejected.item(), loss, rewards_chosen>rewards_rejected)

Yeah, your suggestion helped me fix it. to("cuda") didn’t work but I used quantization and that I think forced the model to be on GPU and then I also put the inputs to cuda and it was quite fast afterwards.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.