Generate: using k-v cache is faster but no difference to memory usage

Good Analysis, but generally you need to monitor max_cuda_allocation to know the max memory choke point in inference call, that will know usage of VRAM,

1 Like