Gen AI on GCP GKE

For those training or deploying HF models on Google Cloud GKE - what’s your experience like? As a user, I look at Models, pick the model I want to fine-tune / deploy and then it’s a personal journey of figuring it out on my own. So far (for inference as example) it’s been:

  1. Download model artifacts manually from HF Model hub
  2. Package a base serving image (like TF Serving or TGI) with model artifacts and run locally
  3. Once happy with serving results, upload to Artifact Registry
  4. Spin up GKE cluster if one isn’t available, write Deployment and Service manifests and deploy the image (Cloud Run is an enticing alternative)
  5. Use external IP to serve results

Some models have guidance on training / deploying for inference so that helps a bit. I’d like to learn from others here:

  1. How do you decide what model to use for your use case given the pace of releases (adjacent question but curious to know :slight_smile: ) ?
  2. What does your stack look like for train / deploy on GKE?
  3. Any major pain points around model discovery / fine-tune / serving ?

We have been using Google Cloud Batch to spawn machines with GPU (on spot instance if you want cheaper), and train. And to serve, we’ve used Cloud Run so far (CPU only inference), since it scales so easily. Interested in seeing what’s done elsewhere!

Interesting. Hadn’t thought of Batch for training jobs. I like the idea of spot instances if checkpoints are continually saved but that’s a cost - reliability tradeoff decision I believe. Thanks for sharing. And yes, Cloud Run for inference makes sense.