Increase summarization speed of llama-2-7b-chat-hf

I’m currently working on a project to give a quick summary of long articles/conversations.

I’m running llama-2-7b-chat-hf with 4bit quantization on an A10 gpu instance

The method I’m using is map_reduce (option 2)from this webpage https://python.langchain.com/docs/use_cases/summarization)

Of everything I’ve tried this is the only one that’s been able to do decent summaries in a reasonable amount of time. However with really long articles (10,000+ words) it takes ~6 minutes before giving an output.

I tried running this same thing on an instance which has 4 A10G gpus but it hasn’t reduced the time by any noticeable amount.

Is there anything else I could be doing to speed this up?

For reference here is the code I’m running in Sagemaker notebook