Is There a Way to Improve Memory Usage When Using Identical `past_key_values` for All Samples in a Batch?

underactuated · October 21, 2024, 2:34am

I think you can use broadcasting, which effectively duplicates a tensor along certain dimensions, without increasing its memory footprint:

def duplicate_pkv(pkv, num_repeats):
  return tuple(tuple(tensor.expand(num_repeats,-1,-1,-1) for tensor in layer) for layer in pkv)

where -1 indicates the dimensions that stay unchanged.

Topic		Replies	Views
Efficient batch inference using stacked past_key_values for multiple continuation candidates Models	1	27	June 10, 2025
Forge synthetic past_key_value batch from multiple outputs Intermediate	0	478	May 12, 2021
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	242	January 22, 2025
KV caching for varying length texts 🤗Transformers	1	162	December 16, 2024
Past_key_value with multiple new tokens Intermediate	1	1364	August 10, 2023