Hi Diego
When using a custom inference script you are leveraging the SageMaker Hugging Face Inference Toolkit. Now the cool part is that this toolkit actually using the pipelines API in the background, see here.
What that means for you is that you actually don’t have to write an inference script. Instead you can provide additional parameters when calling the endpoint, like so (this is an example for text generation, but the same principle applies in your case):
prompt = st.text_area("Enter your prompt here:")
params = {"return_full_text": True,
"temperature": temp,
"min_length": 50,
"max_length": 100,
"do_sample": True,
"repetition_penalty": rep_penalty,
"top_k": 20,
}
payload = {"inputs": prompt, "parameters": params}
response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType='application/json',
Body=json.dumps(payload)
)
Try it out and use the endpoint with the chunk_length_s
parameter, this should work.
Hope that helps!
Cheers
Heiko