I’m currently working on a project to give a quick summary of long articles/conversations.
I’m running llama-2-7b-chat-hf with 4bit quantization on an A10 gpu instance
The method I’m using is map_reduce (option 2)from this webpage https://python.langchain.com/docs/use_cases/summarization)
Of everything I’ve tried this is the only one that’s been able to do decent summaries in a reasonable amount of time. However with really long articles (10,000+ words) it takes ~6 minutes before giving an output.
I tried running this same thing on an instance which has 4 A10G gpus but it hasn’t reduced the time by any noticeable amount.
Is there anything else I could be doing to speed this up?
For reference here is the code I’m running in Sagemaker notebook