GPU memory GPTJ inference

I am using a fine tuned GPTJ and I would like to deployed for inference by taking a batch of input_ids at the same time. Nevertheless I don’t understand why does memory occupancy grow so quickly by increasing the batch size.

from transformers import AutoModelForCausalLM, AutoTokenizer
import tqdm as notebook_tqdm
import torch
import time

torch.set_grad_enabled(False)

model_load_name = "GPTJgse"
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v1-6b")
model = AutoModelForCausalLM.from_pretrained(f"{model_load_name}/",torch_dtype=torch.float16, device_map='auto', load_in_8bit=True, low_cpu_mem_usage=True)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu'); device
print(torch.cuda.memory_reserved())
model.eval()

# model input defined somewhere else
model_input = "some text of 1491 tokens ..."

Here I have correctly reserved ‘6721372160’ Byte for the model because it is a 8 bit quantized 6.7 B parameters model.
When I start inference, with the following code, memory grow very quickly:

def inference(batch_size):    
    torch.cuda.empty_cache()
    inputs = [model_input]*batch_size
    input_ids = tokenizer(inputs,return_tensors="pt",max_length=2048,)['input_ids']
    output = model.forward(input_ids, return_dict=True, use_cache=True)
    print(torch.cuda.memory_reserved())

inference(1)
inference(2)
inference(3)
inference(4)

I got as results:
8621391872
10657726464
12689866752
14684258304

So for each element in the batch I get about 2GB of GPU memory occupied.
Nevertheless if I consider GPTJ hyperparameters: d_head = 256, n_head = 16, n_layers = 28.
I should occupy, assuming 16 bit precision:
2 (Byte) * 3 (values: key, value, query) * 256 * 16 * 28 * 2000 (context length as upper bound) .
That is less then 1.4 GB of memory.

My goal would be to have greater batch_size to better scale inference.
Is there a solution? How could one go about estimating memory requirements?