I’ve been replicating the workflow outlined in this blog-post.
So far, I’ve been successful with deploying multiple replicas of the Mistral Model when using ml.g5.12xlarge and ml.g5.24xlarge. I have one question relating to the MAX_BATCH_TOTAL_TOKENS we set.
Does this parameter limit the number of tokens that can be processed in parallel :
- per replica we create
- across all the replicas we create