I’ve tried a couple configurations of the WeightPruningConfig to see if the model size shrinks or the inference time is reduced. For all the things I’ve tried, the exported size of the model remained the same. I’ve tested only on gpt2 for now, as larger models throw exceptions for running out of memory.
Q1. Is the model size with varying pruning types and sparsities expected to change? I started thinking maybe it won’t shrink because it’s replacing a weight with a 0 (right?), not getting rid of the connection. Is this “0” represented as a float32? Maybe that would explain why it isn’t smaller.
Q2. I tried changing the sparsity from 0.9 to 0.1 and the size and inference time remained the same. The code is below, but I pruned GPT2 and compared that to the pruned models. The pruned models produced a 100 token inference in a shorter time (maybe half the time), but very little difference in the time for 0.9 and 0.1 sparsity. Does this mean maybe it didn’t prune much since it hit the target easily? I didn’t see any output saying what the current sparsity was as it was pruning. I also tried chaniging the type to channelx1 (not really knowing what that type means and didn’t see much change.
Q3. I see the default configuration of pruning_op_types is “Conv” and “Linear”. How can I check my model to ensure these operations are in there (I would assume they are, but I’m new to neural networks). Also how can I see that they have been pruned? How can I see what the op type of a given layer is…and maybe see the % of weights that have been set to 0?
Q4: If combining pruning with quantization, which operation should be done first? I assume pruning since then you can use quantization aware training?
Q5: What is the difference between saving the model with:
trainer.save_model(outpath + “1”)
model.save_pretrained(outpath + “2”)
I was unable to load the model in the second case using INCModelForCausalLM.
Thanks!
Full code:
async def prune_for_forum():
await run(0.9, "outpath90")
await run(0.1, "outpath10")
async def run(sparsity, outpath):
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
dataset = await dataset_manager.get_processed_home_depot_dataset_for_training(tokenizer, 1024)
pruning_config = WeightPruningConfig(target_sparsity=sparsity)
training_args = TrainingArguments(outpath + "0", use_cpu=False, num_train_epochs=0.5, do_train=False, do_eval=False)
trainer = INCTrainer(model=model,
pruning_config=pruning_config,
args=training_args, train_dataset=dataset["train"],
# eval_dataset =dataset["validation"],
eval_dataset=dataset["train"],
tokenizer=tokenizer,
data_collator=DefaultDataCollator()
)
train_result = trainer.train()
print(train_result)
res = trainer.save_model(outpath + "1")
print(res)
model.save_pretrained(outpath + "2")
Running them:
model = intel.INCModelForCausalLM.from_pretrained("outpath901")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device="cpu")
res = await perf.print_duration("Running pipline",
pipe,
"Here is a recipe for vegan banana bread:\n",
max_new_tokens=100,
min_new_tokens=100,
do_sample=True,
use_cache=False)
pass