How to reduce time at production in T5Tokenizer?

I am trying to reduce the time in production. I am using TensorFlow on Amazon Sagemaker. I am unable to figure out how to reduce the time.
Currently, I am facing two issues:

I am unable to figure out why the memory distribution In GPU is uneven.
I am using the below-mentioned code:

    from transformers import T5Tokenizer, TFT5ForConditionalGeneration
    import time
    # initialize the model architecture and weights
    model = TFT5ForConditionalGeneration.from_pretrained("t5-large")
    # initialize the model tokenizer
    tokenizer = T5Tokenizer.from_pretrained("t5-large")

    import tensorflow as tf
    start_time = time.time()

    #strategy = tf.distribute.MultiWorkerMirroredStrategy()
  
    #strategy = tf.distribute.MirroredStrategy()


    #with strategy.scope():
     inputs = tokenizer("summarize: " + text, return_tensors="tf").input_ids

     outputs = model.generate(
               inputs, 
               max_length=150, 
               min_length=41, 
               length_penalty=5, 
               num_beams=2, 
               no_repeat_ngram_size=2, 
               early_stopping=True)

     print(tokenizer.decode(outputs[0]))
     elapsed_time = time.time() - start_time
     print(elapsed_time)

Hi, thanks for posting on the forum! what do you mean by time at production? training time? If you run on SageMaker Training API, you can use the Profiler to diagnose bottlenecks.