Flan-T5 with Tensorflow-Serving

gargantua42 · October 9, 2023, 4:20pm

Hi everyone

Is there a way to run the .generate() function of a fine-tuned Flan-T5 model with Tensorflow-Serving?

I’d like to use the .generate() function with beam search.

I’ve managed to make it run with the basic .call() function, however the answer is roughly 55 MB in size when using the REST API as it returns all the logits and encoder_last_hidden_state.

My model signature looks as follows:

The given SavedModel SignatureDef contains the following input(s):
  inputs ['attention mask'] tensor_into:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving_default_attention_mask:0
  inputs ['decoder_attention_mask'] tensor_info:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving_default_decoder_attention_mask:0
  inputs ['decoder_input_ids'] tensor_info:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving default_decoder_input_ids:0
  inputs ['input_ids'] tensor_info:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs ['encoder_last_hidden_state'] tensor_info:
    dtype: DT_FLOAT
    shape: (-1, -1, 512)
    name: StatefulPartitionedCall:0
  outputs ['logits'] tensor_info:
    dtype: DT_FLOAT
    shape: (-1, -1, 32128)
    name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

After some more digging, I found this PR:

github.com/huggingface/transformers

change shape to support dynamic batch input in tf.function XLA generate for tf serving

huggingface:main ← nlpcat:fix.generate.batch

opened 01:49AM - 30 Jul 22 UTC

nlpcat

+53 -13

# What does this PR do? support dynamic input for tf.function + generate (XLA…). needed for batch tf serving export: ``` import tensorflow as tf from transformers import TFAutoModelForSeq2SeqLM class MyOwnModel(tf.Module): def __init__(self, model_path="t5-small"): super(MyOwnModel, self).__init__() self.model = TFAutoModelForSeq2SeqLM.from_pretrained(model_path) @tf.function(input_signature=(tf.TensorSpec((None, 32), tf.int32, name="input_ids"), tf.TensorSpec((None, 32), tf.int32, name="attention_mask")), jit_compile=True) def serving(self, input_ids, attention_mask): outputs = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=32, return_dict_in_generate=True) return {"sequences": outputs["sequences"]} model = MyOwnModel() export_dir = "./" tf.saved_model.save( model, export_dir, signatures={ "serving_default": model.serving }) ``` tf model run ``` import tensorflow as tf from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM export_dir = "./" model = tf.saved_model.load(export_dir) tokenizer = AutoTokenizer.from_pretrained("t5-small") tokenization_kwargs = {"pad_to_multiple_of": 32, "padding": True, "return_tensors": "tf"} input_prompts = [ f"translate English to {language}: I have four cats and three dogs." for language in ["German", "French", "Romanian"] ] def generate_text(inputs): tokenized_inputs = tokenizer(inputs, **tokenization_kwargs) generated_texts = model.signatures["serving_default"](**tokenized_inputs) for text in generated_texts["sequences"]: print(tokenizer.decode(text, skip_special_tokens=True)) # The first prompt will be slow (compiling), the others will be very fast! generate_text(input_prompts[:2]) generate_text(input_prompts[:3]) ``` xla_run ``` import tensorflow as tf from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("t5-small") model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small") # Main changes with respect to the original generate workflow: `tf.function` and `pad_to_multiple_of` xla_generate = tf.function(model.generate, jit_compile=True) tokenization_kwargs = {"pad_to_multiple_of": 32, "padding": True, "return_tensors": "tf"} # The first prompt will be slow (compiling), the others will be very fast! input_prompts = [ f"translate English to {language}: I have four cats and three dogs." for language in ["German", "French", "Romanian"] ] tokenized_inputs = tokenizer(input_prompts, **tokenization_kwargs) generated_texts = xla_generate(**tokenized_inputs, max_new_tokens=32) for text in generated_texts: print(tokenizer.decode(text, skip_special_tokens=True)) ``` this also works for beam search by changing exported code as ``` def serving(self, input_ids, attention_mask): outputs = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=32, return_dict_in_generate=True, num_beams=3, num_return_sequences=3) return {"sequences": outputs["sequences"]} ```  Fixes #18357 Fixes #16823 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [x] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  cc @gante @patrickvonplaten

My Exporter Code looks as follows:

import tensorflow as tf
import transformers


class MyModule(tf.Module):
  def _init_(self, model: transformers.TFAutoModelForSeq2SeqLM) - None:
    super(MyModule, self)._init_()
    self.model = model

@tf. function(
  input_signature=(
    tf.TensorSpec(name="input_ids", shape=(None, None), dtype=tf.int32),
    tf.TensorSpec(name="attention_mask", shape=(None, None), dtype=tf.int32)
  ),
  jit_compile=False,
)

def serving(self, input_ids, attention_mask) -> dict:
  outputs = self.model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=512,
    min_length=10,
    # length_penalty=0.9,
    # repetition_penalty=2.0,
    # num_beams=4,
    # early_stopping=True,
    return_dict_in_generate=True,
  )
  return {"sequences": outputs["sequences"]}

Using this exporter I get the following model signature:

The given SavedModel SignatureDef contains the following input (s):
  inputs['attention mask'] tensor_info:
    dtype: DT_INT32
    shape: (-1, -1)
    name: serving_default_attention_mask:0
  inputs['input_ids"] tensor_info:
    type: DT_INT32
    shape: (-1, -1)
    name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['sequences'] tensor info:
    dtype: DT_INT32
    shape: (-1, 512)
    name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

However, when I try to run inference on the model using the default tensorflow/serving docker image (REST API), I get the following error:

{
  "error": "XLA compilation disabled\n\t [[{{function node while body_14884}}{{node while/tft5_for_conditional_generation/decoder/block_._0/layer_._0/SelfAttention/XLaDynamicSLice}}]]"
}

What is the right way to go to serve a fine-tunes flan-T5 modle using Tensorflow-Serving?

I’m using transformers version 4.34.0 and tensorflow version 2.13.1.

Maybe @joaogante?

Thanks in advance!

Topic		Replies	Views
Run tensorflow transformer T5 model with huggingface generate() function return bad reply 🤗Transformers	2	557	January 4, 2023
How to implement generate function for seperate encoder decoder T5 model? Models	0	842	February 10, 2022
T5 Model Generate and Model Outputs Vastly Different Beginners	1	815	September 11, 2022
Is that possible to embed the tokenizer into the model to have it running on GCP using TensorFlow Serving? 🤗Tokenizers	4	3235	January 12, 2023
`serving` signature in TensorFlow Serving blogpost Intermediate	2	821	August 9, 2021

Flan-T5 with Tensorflow-Serving

Related topics