Speculative Decoding with Qwen Models

felix7krn · March 4, 2025, 6:51pm

checkpoint_target_model = "Qwen/Qwen2.5-14B-Instruct"
checkpoint_draft_model = "Qwen/Qwen2.5-0.5B-Instruct"
target_tokenizer = AutoTokenizer.from_pretrained(checkpoint_target_model)
target_model = AutoModelForCausalLM.from_pretrained(checkpoint_target_model, torch_dtype=torch.float16)
draft_model = AutoModelForCausalLM.from_pretrained(checkpoint_draft_model, torch_dtype=torch.float16)

prompt = "Give me detailed info about cars"
input = target_tokenizer(prompt, return_tensors="pt")
output = target_model.generate(**input, assistant_model= draft_model,tokenizer=target_tokenizer, max_new_tokens=200,pad_token_id=target_tokenizer.eos_token_id)
print(target_tokenizer.batch_decode(output, skip_special_tokens=True))

this is the code I used and I got this error message

ValueError: The main and assistant moedels have different tokenizers. Please provide `tokenizer` and `assistant_tokenizer` to `generate()` (see https://huggingface.co/docs/transformers/en/generation_strategies#universal-assisted-decoding).

I actually thought that the two models come from the same model family and therefore use the same tokenizer in any case. I have now found out that the Qwen models have different vocabulary sizes.
Is this correct and can I only use the Universal Assisted Generation with these models, where I have to pass the Draft tokenizer as well?
Or is there a possibility to use the pure Speculative Decoding? Not really, is there?
Thanks for help

John6666 · March 5, 2025, 11:00am

I wonder if there’s something wrong with 14B.

# https://huggingface.co/blog/dynamic_speculation_lookahead

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "Alice and Bob"
#checkpoint = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
#assistant_checkpoint = "HuggingFaceTB/SmolLM2-135M-Instruct"
#checkpoint = "Qwen/Qwen2.5-14B-Instruct"
#checkpoint = "Qwen/Qwen2.5-3B-Instruct"
checkpoint = "Qwen/Qwen2.5-1.5B-Instruct"
assistant_checkpoint = "Qwen/Qwen2.5-0.5B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)

outputs = model.generate(**inputs, assistant_model=assistant_model, tokenizer=tokenizer, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

["Alice and Bob are playing a game with two dice. Each die is fair, but they decide to roll them simultaneously and take the sum of their outcomes. They play this game 10 times. What is the probability that the total sum of all their rolls is exactly 25? To determine the probability that the total sum of all their rolls is exactly 25 after 10 games, we need to analyze the possible outcomes of each individual roll and then consider the combined outcomes over multiple trials.\n\nFirst, let's find out how many different sums can be obtained from rolling two fair six-sided dice. The smallest sum is \\(2\\) (when both dice show 1) and the largest sum is \\(12\\) (when both dice show 6). Therefore, the possible sums range from 2 to 12.\n\nNext, we calculate the number of ways to get each specific sum:\n- Sum = 2: (1,1) — 1 way\n- Sum = "]

Topic		Replies	Views
Qwen/Qwen-7B-Chat Models	0	423	February 13, 2024
What model-pairs are supported by the assistant decoding generation in Huggingface AutoModelForCausalLM? 🤗Transformers	1	176	March 13, 2024
Fine-Tuning Qwen/Qwen2.5-Coder-0.5B: Mismatched Input and Target Batch Sizes Beginners	2	369	February 27, 2025
ValueError: Please use the `disk_offload` function instead Beginners	1	943	August 21, 2024
Ask for help: Output inconsistency when using LLM batch inference compared to single input Beginners	4	103	March 20, 2025

Speculative Decoding with Qwen Models

Related topics