Model quantization

I am trying to do the static quantization on the T5 model(flexudy/t5-small-wav2vec2-grammar-fixer) for reducing the inference time .

Code :
import torch
import transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration
model_name = “flexudy/t5-small-wav2vec2-grammar-fixer”
torch_device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’
model = T5ForConditionalGeneration.from_pretrained(model_name).to(torch_device)
model.eval()
model.qconfig = torch.quantization.get_default_qconfig(‘fbgemm’)
model_fused = torch.quantization.fuse_modules(model,[[‘linear’, ‘linear’]])

But it says that
AttributeError: ‘T5ForConditionalGeneration’ object has no attribute ‘linear’

use dynamic quan.

Any example for T5 dynamic quantizarion?

https://snappishproductions.com/blog/2020/05/03/big-models-hate-this-one-weird-trick-quantization-t5--pytorch-1.4.html.html

@ndvb look this method

i dont know how sequence model work with quantization , i thick this link may help u GitHub - Ki6an/fastT5: ⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.

@ Pradeep1995 Thanks Pradeep,

  1. the line:
    quantized_model.save_pretrained(“t5”)
    Gives an error:
    AttributeError: ‘torch.dtype’ object has no attribute ‘numel’
    I believe it’s the issue:
    I think you can’t save_pretrained a quantized model, so you can’t do what’s in this article.

  2. I did manage to solve it using another way, but surprisingly, the quantized model take more time to run, please try to run this code that I wrote:


import torch
from transformers import pipeline, T5ForConditionalGeneration, AutoConfig, T5Tokenizer
import torch.quantization

tokenizer = T5Tokenizer.from_pretrained('t5-small')
if torch.cuda.is_available():
	device = "cuda:0"
else:
	device = "cpu"

base_model = T5ForConditionalGeneration.from_pretrained("t5-small")
param_count = sum(p.numel() for p in base_model.parameters())
memory = (param_count * 4) / (1024 * 1024)
print(f'memory in MB: {memory}')

base_model.save_pretrained("tmp-t5-small")

quantized_model = torch.quantization.quantize_dynamic(model=base_model,
                                                      qconfig_spec={torch.nn.Linear},
                                                      dtype=torch.qint8)

# This does NOT work:
#quantized_model.save_pretrained("tmp-t5-small-quantized")

quantized_model.config.save_pretrained("tmp-t5-small-quantized-config")  # save config
quantized_state_dict = quantized_model.state_dict()
torch.save(quantized_state_dict, "tmp-t5-small-quantized-state-dict.pt")

print('Load quantized model')
quantized_config = AutoConfig.from_pretrained("tmp-t5-small-quantized-config")
dummy_model = T5ForConditionalGeneration(quantized_config)

reconstructed_quantized_model = torch.quantization.quantize_dynamic(
    dummy_model, {torch.nn.Linear}, dtype=torch.qint8
)
reconstructed_quantized_model.load_state_dict(torch.load("tmp-t5-small-quantized-state-dict.pt"))

def eval(model, tokenizer, sentence):
	import time
	s = time.time()
	model.eval()
	test_ids = tokenizer(sentence, return_tensors="pt").to(device).input_ids
	beam_output = model.generate(test_ids)
	print(f"eval sentence: [{str(tokenizer.decode(beam_output[0], skip_special_tokens=True))}], took {(time.time()-s)}")

prompt = "summarize: From the very beginning, Regan was seen as having series potential. After the television film scored highly in the ratings, work began on the development of the series proper. Ian Kennedy Martin's idea was for the series to be mainly studio-based, with more dialogue and less action, but producer Ted Childs disagreed, and in consequence Ian Kennedy Martin parted company with the project. Childs produced it on 16mm film, a format that allowed for a much smaller film unit than videotape at that time. This made it possible to shoot almost entirely on location which helped give the series a startling degree of realism and to use film editing techniques which enabled him to give the show a heavy bias toward action sequences. The television play and the subsequent series were commissioned by Thames Television and produced by its film division Euston Films. It was originally broadcast on ITV between 2 January 1975 and 28 December 1978 at 21:00–22:00 on weekdays (usually Mondays), with repeated screenings at the same time until the early 1980s. The writers were given strict guidelines to follow: \"Each show will have an overall screen time (minus titles) of 48 minutes 40 seconds. Each film will open with a teaser of up to 3 minutes, which will be followed by the opening titles. The story will be played across three acts, each being no more than 19 minutes and no less than 8 minutes in length. Regan will appear in every episode, Carter in approximately 10 out of 13 episodes. In addition to these main characters, scripts should be based around three major speaking parts, with up to ten minor speaking parts."
print('Quantized model generate()')
eval(reconstructed_quantized_model, tokenizer, prompt)
eval(reconstructed_quantized_model, tokenizer, prompt)
print('Base model generate()')
eval(base_model, tokenizer, prompt)
eval(base_model, tokenizer, prompt)