Hello @bala1802 .
Do you mind sharing your testing scripts ? Inference time can widely vary depending on the input, and the actual parameters used to generate the text.
1/ Did you use use_gpu
flag to actually use the GPU on the inference ?
I’m seeing 6s
inference on my test string.
curl -X POST -d '{"inputs": "toto", "options": {"use_gpu": true, "use_cache": false}}' https://api-inference.huggingface.co/models/balawmt/LanguageModel_Trial_1 -H "Authorization: Bearer ${HF_API_TOKEN}" -D -
2/ first time vs second time, should not really make a difference , are you trying 2 different payloads ?
3/ The actual run time of a query on a text-generation
pipeline can depend on the EOS token being generated randomly (otherwise it will simply generate max_tokens
which seems to be set to 500 for your model). So when trying to test inference time, you need to make sure that you are generating the same number of tokens, and that EOS cannot be generated.
Hope that helps.