AutoModelForCausalLM and Openvino

Following Optimization I would like to quantize an AutoModelForCausalLM such as gpt2 in Openvino.

First I got that text-generation is not supported. I also tried this quantizer = OVQuantizer.from_pretrained(model, feature='causal-lm') but I get other errors.

Are AutoModelForCausalLM models supported?

I saw that I can accelerate a model at inference time using:

from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.intel.openvino import OVModelForCausalLM

import time

save_dir = "gpt2"

t = time.time()
# cpu
# ov_model = AutoModelForCausalLM.from_pretrained(save_dir)
# openvino
ov_model = OVModelForCausalLM.from_pretrained(save_dir, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(save_dir, device=0)
cls_pipe = pipeline("text-generation", model=ov_model, tokenizer=tokenizer)
elapsed = time.time() - t
print(elapsed)


text = "He's a dreadful magician."
for i in range(3):
  t = time.time()
  outputs = cls_pipe(text)
  print(outputs)
  elapsed = time.time() - t
  print(elapsed)

Though I saw that Openvino results take about 2.5s where normal CPU takes 1s. I have some Intel cores.

Hi @ialuronico,

Causal language models are not yet supported for quantization using OpenVINO NNCF but we are currently working on its integration.

In order to compute the gain in latency by comparing the OpenVINO with the PyTorch model, you can use :

tokens = tokenizer("He's a dreadful magician.", return_tensors="pt")

def elapsed_time(model, nb_pass=40):
    # warmup
    for _ in range(10):
        _ = model(**tokens)
    start = time.time()
    for _ in range(nb_pass):
        _ = model(**tokens)
    end = time.time()
    return (end - start) / nb_pass
1 Like

Hi @echarlaix Fantastic, thanks. So the version that accelerates at inference time is Openvino but does not use NNCF? What is the difference between the two approaches?

Actually using your script it shows it gets sped up!

Though, only using the function from_pretrained does not really optimize the model?

When using the from_pretrained method, graph optimizations will be applied on your model. NNCF will enable more advanced optimizations such as quantization, currently both quantization aware training and post-training static quantization are supported, you can find additional information and examples in our documentation. We also plan to integrate additional compression techniques such as pruning and knowledge distillation in the near future.

Also, the NNCF quantization of causal language models is now enabled (#176).

1 Like