In the blog post on speculative decoding, the main idea is that we can use a smaller model to generate tokens, and “verify” these generated tokens in one go using the larger model
However, I dont quite follow how we can verify multiple tokens in a single forward pass of the larger model.
For example, assuming tokens are at word boundaries, and the smaller model generates the following text in 5 forward passes
the quick brown sock jumps
since we do not know which token / word is incorrect (actually 4th word), wont we need to check each token one by one using the larger model? That will require atleast 4 verifications:
- the → quick [pass]
- the quick → brown [pass]
- the quick brown → sock [fail]
- the quick brown fox → jumps [pass]
The blog (and the underlying paper) seems to claim that all these verifications can be done in a single forward pass of the larger model. How is that feasible?