Hi everyone
Is there a way to run the .generate()
function of a fine-tuned Flan-T5 model with Tensorflow-Serving?
Iād like to use the .generate()
function with beam search.
Iāve managed to make it run with the basic .call()
function, however the answer is roughly 55 MB in size when using the REST API as it returns all the logits
and encoder_last_hidden_state
.
My model signature looks as follows:
The given SavedModel SignatureDef contains the following input(s):
inputs ['attention mask'] tensor_into:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_attention_mask:0
inputs ['decoder_attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_decoder_attention_mask:0
inputs ['decoder_input_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving default_decoder_input_ids:0
inputs ['input_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs ['encoder_last_hidden_state'] tensor_info:
dtype: DT_FLOAT
shape: (-1, -1, 512)
name: StatefulPartitionedCall:0
outputs ['logits'] tensor_info:
dtype: DT_FLOAT
shape: (-1, -1, 32128)
name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict
After some more digging, I found this PR:
huggingface:main
ā nlpcat:fix.generate.batch
opened 01:49AM - 30 Jul 22 UTC
# What does this PR do?
support dynamic input for tf.function + generate (XLAā¦ ). needed for batch tf serving
export:
```
import tensorflow as tf
from transformers import TFAutoModelForSeq2SeqLM
class MyOwnModel(tf.Module):
def __init__(self, model_path="t5-small"):
super(MyOwnModel, self).__init__()
self.model = TFAutoModelForSeq2SeqLM.from_pretrained(model_path)
@tf.function(input_signature=(tf.TensorSpec((None, 32), tf.int32, name="input_ids"),
tf.TensorSpec((None, 32), tf.int32, name="attention_mask")), jit_compile=True)
def serving(self, input_ids, attention_mask):
outputs = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=32,
return_dict_in_generate=True)
return {"sequences": outputs["sequences"]}
model = MyOwnModel()
export_dir = "./"
tf.saved_model.save(
model,
export_dir,
signatures={
"serving_default":
model.serving
})
```
tf model run
```
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM
export_dir = "./"
model = tf.saved_model.load(export_dir)
tokenizer = AutoTokenizer.from_pretrained("t5-small")
tokenization_kwargs = {"pad_to_multiple_of": 32, "padding": True, "return_tensors": "tf"}
input_prompts = [
f"translate English to {language}: I have four cats and three dogs."
for language in ["German", "French", "Romanian"]
]
def generate_text(inputs):
tokenized_inputs = tokenizer(inputs, **tokenization_kwargs)
generated_texts = model.signatures["serving_default"](**tokenized_inputs)
for text in generated_texts["sequences"]:
print(tokenizer.decode(text, skip_special_tokens=True))
# The first prompt will be slow (compiling), the others will be very fast!
generate_text(input_prompts[:2])
generate_text(input_prompts[:3])
```
xla_run
```
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = TFAutoModelForSeq2SeqLM.from_pretrained("t5-small")
# Main changes with respect to the original generate workflow: `tf.function` and `pad_to_multiple_of`
xla_generate = tf.function(model.generate, jit_compile=True)
tokenization_kwargs = {"pad_to_multiple_of": 32, "padding": True, "return_tensors": "tf"}
# The first prompt will be slow (compiling), the others will be very fast!
input_prompts = [
f"translate English to {language}: I have four cats and three dogs."
for language in ["German", "French", "Romanian"]
]
tokenized_inputs = tokenizer(input_prompts, **tokenization_kwargs)
generated_texts = xla_generate(**tokenized_inputs, max_new_tokens=32)
for text in generated_texts:
print(tokenizer.decode(text, skip_special_tokens=True))
```
this also works for beam search by changing exported code as
```
def serving(self, input_ids, attention_mask):
outputs = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=32,
return_dict_in_generate=True, num_beams=3, num_return_sequences=3)
return {"sequences": outputs["sequences"]}
```
<!--
Congratulations! You've made it this far! You're not quite done yet though.
Once merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.
Then, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.
Once you're done, someone will review your PR shortly (see the section "Who can review?" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.
-->
Fixes #18357
Fixes #16823
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the
[documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and
[here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [x] Did you write any new necessary tests?
## Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**.
Please tag fewer than 3 people.
Models:
- albert, bert, xlm: @LysandreJik
- blenderbot, bart, marian, pegasus, encoderdecoder, t5: @patrickvonplaten, @patil-suraj
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
- fsmt: @stas00
- funnel: @sgugger
- gpt2: @patrickvonplaten, @LysandreJik
- rag: @patrickvonplaten, @lhoestq
- tensorflow: @LysandreJik
Library:
- benchmarks: @patrickvonplaten
- deepspeed: @stas00
- ray/raytune: @richardliaw, @amogkam
- text generation: @patrickvonplaten
- tokenizers: @n1t0, @LysandreJik
- trainer: @sgugger
- pipelines: @LysandreJik
Documentation: @sgugger
HF projects:
- datasets: [different repo](https://github.com/huggingface/datasets)
- rust tokenizers: [different repo](https://github.com/huggingface/tokenizers)
Examples:
- maintained examples (not research project or legacy): @sgugger, @patil-suraj
- research_projects/bert-loses-patience: @JetRunner
- research_projects/distillation: @VictorSanh
-->
cc @gante @patrickvonplaten
My Exporter Code looks as follows:
import tensorflow as tf
import transformers
class MyModule(tf.Module):
def _init_(self, model: transformers.TFAutoModelForSeq2SeqLM) - None:
super(MyModule, self)._init_()
self.model = model
@tf. function(
input_signature=(
tf.TensorSpec(name="input_ids", shape=(None, None), dtype=tf.int32),
tf.TensorSpec(name="attention_mask", shape=(None, None), dtype=tf.int32)
),
jit_compile=False,
)
def serving(self, input_ids, attention_mask) -> dict:
outputs = self.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_length=512,
min_length=10,
# length_penalty=0.9,
# repetition_penalty=2.0,
# num_beams=4,
# early_stopping=True,
return_dict_in_generate=True,
)
return {"sequences": outputs["sequences"]}
Using this exporter I get the following model signature:
The given SavedModel SignatureDef contains the following input (s):
inputs['attention mask'] tensor_info:
dtype: DT_INT32
shape: (-1, -1)
name: serving_default_attention_mask:0
inputs['input_ids"] tensor_info:
type: DT_INT32
shape: (-1, -1)
name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['sequences'] tensor info:
dtype: DT_INT32
shape: (-1, 512)
name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict
However, when I try to run inference on the model using the default tensorflow/serving
docker image (REST API), I get the following error:
{
"error": "XLA compilation disabled\n\t [[{{function node while body_14884}}{{node while/tft5_for_conditional_generation/decoder/block_._0/layer_._0/SelfAttention/XLaDynamicSLice}}]]"
}
What is the right way to go to serve a fine-tunes flan-T5 modle using Tensorflow-Serving?
Iām using transformers
version 4.34.0
and tensorflow
version 2.13.1
.
Maybe @joaogante ?
Thanks in advance!