AutoModelForCausalLM and Openvino

ialuronico · January 26, 2023, 9:35am

Following Optimization I would like to quantize an AutoModelForCausalLM such as gpt2 in Openvino.

First I got that text-generation is not supported. I also tried this quantizer = OVQuantizer.from_pretrained(model, feature='causal-lm') but I get other errors.

Are AutoModelForCausalLM models supported?

ialuronico · January 26, 2023, 12:50pm

I saw that I can accelerate a model at inference time using:

from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM

import time

save_dir = "gpt2"

t = time.time()
# cpu
# ov_model = AutoModelForCausalLM.from_pretrained(save_dir)
# openvino
ov_model = OVModelForCausalLM.from_pretrained(save_dir, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(save_dir, device=0)
cls_pipe = pipeline("text-generation", model=ov_model, tokenizer=tokenizer)
elapsed = time.time() - t
print(elapsed)


text = "He's a dreadful magician."
for i in range(3):
  t = time.time()
  outputs = cls_pipe(text)
  print(outputs)
  elapsed = time.time() - t
  print(elapsed)

Though I saw that Openvino results take about 2.5s where normal CPU takes 1s. I have some Intel cores.

echarlaix · January 26, 2023, 5:03pm

Hi @ialuronico,

Causal language models are not yet supported for quantization using OpenVINO NNCF but we are currently working on its integration.

In order to compute the gain in latency by comparing the OpenVINO with the PyTorch model, you can use :

tokens = tokenizer("He's a dreadful magician.", return_tensors="pt")

def elapsed_time(model, nb_pass=40):
    # warmup
    for _ in range(10):
        _ = model(**tokens)
    start = time.time()
    for _ in range(nb_pass):
        _ = model(**tokens)
    end = time.time()
    return (end - start) / nb_pass

ialuronico · January 26, 2023, 5:48pm

Hi @echarlaix Fantastic, thanks. So the version that accelerates at inference time is Openvino but does not use NNCF? What is the difference between the two approaches?

ialuronico · January 27, 2023, 7:04pm

Actually using your script it shows it gets sped up!

Though, only using the function from_pretrained does not really optimize the model?

echarlaix · February 3, 2023, 2:52pm

When using the from_pretrained method, graph optimizations will be applied on your model. NNCF will enable more advanced optimizations such as quantization, currently both quantization aware training and post-training static quantization are supported, you can find additional information and examples in our documentation. We also plan to integrate additional compression techniques such as pruning and knowledge distillation in the near future.

Also, the NNCF quantization of causal language models is now enabled (#176).

Topic		Replies	Views
AttributeError: OV_ModelForTokenClassificatio' object has no attribute 'modules' 🤗Optimum	1	304	May 21, 2024
Intel OpenVINO backend 🤗Transformers	1	1127	November 1, 2021
Is there way to convert the Donut model to openvino format 🤗Transformers	0	169	September 28, 2023
Optimum library optimization and quantization fails 🤗Optimum	8	1565	February 22, 2025
AutoModelForCausalLM and transformers.pipeline Beginners	2	640	August 29, 2024

AutoModelForCausalLM and Openvino

Related topics