Scaling Mistral-7B on AWS SageMaker With Multiple Replica Endpoints

Hi,

I’ve been replicating the workflow outlined in this blog-post.

So far, I’ve been successful with deploying multiple replicas of the Mistral Model when using ml.g5.12xlarge and ml.g5.24xlarge. I have one question relating to the MAX_BATCH_TOTAL_TOKENS we set.

Does this parameter limit the number of tokens that can be processed in parallel :

  • per replica we create
  • across all the replicas we create
    ?