the line:
quantized_model.save_pretrained(“t5”)
Gives an error:
AttributeError: ‘torch.dtype’ object has no attribute ‘numel’
I believe it’s the issue:
I think you can’t save_pretrained a quantized model, so you can’t do what’s in this article.
I did manage to solve it using another way, but surprisingly, the quantized model take more time to run, please try to run this code that I wrote:
import torch
from transformers import pipeline, T5ForConditionalGeneration, AutoConfig, T5Tokenizer
import torch.quantization
tokenizer = T5Tokenizer.from_pretrained('t5-small')
if torch.cuda.is_available():
device = "cuda:0"
else:
device = "cpu"
base_model = T5ForConditionalGeneration.from_pretrained("t5-small")
param_count = sum(p.numel() for p in base_model.parameters())
memory = (param_count * 4) / (1024 * 1024)
print(f'memory in MB: {memory}')
base_model.save_pretrained("tmp-t5-small")
quantized_model = torch.quantization.quantize_dynamic(model=base_model,
qconfig_spec={torch.nn.Linear},
dtype=torch.qint8)
# This does NOT work:
#quantized_model.save_pretrained("tmp-t5-small-quantized")
quantized_model.config.save_pretrained("tmp-t5-small-quantized-config") # save config
quantized_state_dict = quantized_model.state_dict()
torch.save(quantized_state_dict, "tmp-t5-small-quantized-state-dict.pt")
print('Load quantized model')
quantized_config = AutoConfig.from_pretrained("tmp-t5-small-quantized-config")
dummy_model = T5ForConditionalGeneration(quantized_config)
reconstructed_quantized_model = torch.quantization.quantize_dynamic(
dummy_model, {torch.nn.Linear}, dtype=torch.qint8
)
reconstructed_quantized_model.load_state_dict(torch.load("tmp-t5-small-quantized-state-dict.pt"))
def eval(model, tokenizer, sentence):
import time
s = time.time()
model.eval()
test_ids = tokenizer(sentence, return_tensors="pt").to(device).input_ids
beam_output = model.generate(test_ids)
print(f"eval sentence: [{str(tokenizer.decode(beam_output[0], skip_special_tokens=True))}], took {(time.time()-s)}")
prompt = "summarize: From the very beginning, Regan was seen as having series potential. After the television film scored highly in the ratings, work began on the development of the series proper. Ian Kennedy Martin's idea was for the series to be mainly studio-based, with more dialogue and less action, but producer Ted Childs disagreed, and in consequence Ian Kennedy Martin parted company with the project. Childs produced it on 16mm film, a format that allowed for a much smaller film unit than videotape at that time. This made it possible to shoot almost entirely on location which helped give the series a startling degree of realism and to use film editing techniques which enabled him to give the show a heavy bias toward action sequences. The television play and the subsequent series were commissioned by Thames Television and produced by its film division Euston Films. It was originally broadcast on ITV between 2 January 1975 and 28 December 1978 at 21:00–22:00 on weekdays (usually Mondays), with repeated screenings at the same time until the early 1980s. The writers were given strict guidelines to follow: \"Each show will have an overall screen time (minus titles) of 48 minutes 40 seconds. Each film will open with a teaser of up to 3 minutes, which will be followed by the opening titles. The story will be played across three acts, each being no more than 19 minutes and no less than 8 minutes in length. Regan will appear in every episode, Carter in approximately 10 out of 13 episodes. In addition to these main characters, scripts should be based around three major speaking parts, with up to ten minor speaking parts."
print('Quantized model generate()')
eval(reconstructed_quantized_model, tokenizer, prompt)
eval(reconstructed_quantized_model, tokenizer, prompt)
print('Base model generate()')
eval(base_model, tokenizer, prompt)
eval(base_model, tokenizer, prompt)