Regarding: I’d like to understand the configuration of the Start Up for Organizations.
I’ve a language model finetuned on top of the pretrained GPT2 - Small. When I deployed the LanguageModel in AWS - Sagemaker, Google Colab, HuggingFace below are observed.
2/ first time vs second time, should not really make a difference , are you trying 2 different payloads ?
3/ The actual run time of a query on a text-generation pipeline can depend on the EOS token being generated randomly (otherwise it will simply generate max_tokens which seems to be set to 500 for your model). So when trying to test inference time, you need to make sure that you are generating the same number of tokens, and that EOS cannot be generated.