Flan-T5 with Tensorflow-Serving

Hi everyone

Is there a way to run the .generate() function of a fine-tuned Flan-T5 model with Tensorflow-Serving?

Iā€™d like to use the .generate() function with beam search.

Iā€™ve managed to make it run with the basic .call() function, however the answer is roughly 55 MB in size when using the REST API as it returns all the logits and encoder_last_hidden_state.

My model signature looks as follows:

The given SavedModel SignatureDef contains the following input(s):
  inputs ['attention mask'] tensor_into:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving_default_attention_mask:0
  inputs ['decoder_attention_mask'] tensor_info:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving_default_decoder_attention_mask:0
  inputs ['decoder_input_ids'] tensor_info:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving default_decoder_input_ids:0
  inputs ['input_ids'] tensor_info:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs ['encoder_last_hidden_state'] tensor_info:
    dtype: DT_FLOAT
    shape: (-1, -1, 512)
    name: StatefulPartitionedCall:0
  outputs ['logits'] tensor_info:
    dtype: DT_FLOAT
    shape: (-1, -1, 32128)
    name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

After some more digging, I found this PR:

My Exporter Code looks as follows:

import tensorflow as tf
import transformers


class MyModule(tf.Module):
  def _init_(self, model: transformers.TFAutoModelForSeq2SeqLM) - None:
    super(MyModule, self)._init_()
    self.model = model

@tf. function(
  input_signature=(
    tf.TensorSpec(name="input_ids", shape=(None, None), dtype=tf.int32),
    tf.TensorSpec(name="attention_mask", shape=(None, None), dtype=tf.int32)
  ),
  jit_compile=False,
)

def serving(self, input_ids, attention_mask) -> dict:
  outputs = self.model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=512,
    min_length=10,
    # length_penalty=0.9,
    # repetition_penalty=2.0,
    # num_beams=4,
    # early_stopping=True,
    return_dict_in_generate=True,
  )
  return {"sequences": outputs["sequences"]}

Using this exporter I get the following model signature:

The given SavedModel SignatureDef contains the following input (s):
  inputs['attention mask'] tensor_info:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving_default_attention_mask:0
  inputs['input_ids"] tensor_info:
    type: DT_INT32
    shape: (-1, -1)
    name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['sequences'] tensor info:
    dtype: DT_INT32
    shape: (-1, 512)
    name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

However, when I try to run inference on the model using the default tensorflow/serving docker image (REST API), I get the following error:

{
  "error": "XLA compilation disabled\n\t [[{{function node while body_14884}}{{node while/tft5_for_conditional_generation/decoder/block_._0/layer_._0/SelfAttention/XLaDynamicSLice}}]]"
}

What is the right way to go to serve a fine-tunes flan-T5 modle using Tensorflow-Serving?

Iā€™m using transformers version 4.34.0 and tensorflow version 2.13.1.

Maybe @joaogante?

Thanks in advance!