Speculative Decoding with Qwen Models

checkpoint_target_model = "Qwen/Qwen2.5-14B-Instruct"
checkpoint_draft_model = "Qwen/Qwen2.5-0.5B-Instruct"
target_tokenizer = AutoTokenizer.from_pretrained(checkpoint_target_model)
target_model = AutoModelForCausalLM.from_pretrained(checkpoint_target_model, torch_dtype=torch.float16)
draft_model = AutoModelForCausalLM.from_pretrained(checkpoint_draft_model, torch_dtype=torch.float16)

prompt = "Give me detailed info about cars"
input = target_tokenizer(prompt, return_tensors="pt")
output = target_model.generate(**input, assistant_model= draft_model,tokenizer=target_tokenizer, max_new_tokens=200,pad_token_id=target_tokenizer.eos_token_id)
print(target_tokenizer.batch_decode(output, skip_special_tokens=True))

this is the code I used and I got this error message

ValueError: The main and assistant moedels have different tokenizers. Please provide `tokenizer` and `assistant_tokenizer` to `generate()` (see https://huggingface.co/docs/transformers/en/generation_strategies#universal-assisted-decoding).

I actually thought that the two models come from the same model family and therefore use the same tokenizer in any case. I have now found out that the Qwen models have different vocabulary sizes.
Is this correct and can I only use the Universal Assisted Generation with these models, where I have to pass the Draft tokenizer as well?
Or is there a possibility to use the pure Speculative Decoding? Not really, is there?
Thanks for help

1 Like

I wonder if there’s something wrong with 14B.

# https://huggingface.co/blog/dynamic_speculation_lookahead

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "Alice and Bob"
#checkpoint = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
#assistant_checkpoint = "HuggingFaceTB/SmolLM2-135M-Instruct"
#checkpoint = "Qwen/Qwen2.5-14B-Instruct"
#checkpoint = "Qwen/Qwen2.5-3B-Instruct"
checkpoint = "Qwen/Qwen2.5-1.5B-Instruct"
assistant_checkpoint = "Qwen/Qwen2.5-0.5B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)

outputs = model.generate(**inputs, assistant_model=assistant_model, tokenizer=tokenizer, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
["Alice and Bob are playing a game with two dice. Each die is fair, but they decide to roll them simultaneously and take the sum of their outcomes. They play this game 10 times. What is the probability that the total sum of all their rolls is exactly 25? To determine the probability that the total sum of all their rolls is exactly 25 after 10 games, we need to analyze the possible outcomes of each individual roll and then consider the combined outcomes over multiple trials.\n\nFirst, let's find out how many different sums can be obtained from rolling two fair six-sided dice. The smallest sum is \\(2\\) (when both dice show 1) and the largest sum is \\(12\\) (when both dice show 6). Therefore, the possible sums range from 2 to 12.\n\nNext, we calculate the number of ways to get each specific sum:\n- Sum = 2: (1,1) — 1 way\n- Sum = "]