Hi Simon, I’ve done that just as you said. The endpoint deploys but I keep getting this persistent error:
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
here is the code I ran, I cleared out my api key:
model_name = "CogniXpert"
hub = {
'HF_MODEL_ID':'I00N/CogniXpert',
'HF_TASK':'text-generation',
'HF_API_TOKEN': "",
'HF_MODEL_QUANTIZE':'bitsandbytes'
}
model = HuggingFaceModel(
name=model_name,
env=hub,
role=role,
image_uri=image_uri
)
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.g5.xlarge",
endpoint_name=model_name
)