Increase summarization speed of llama-2-7b-chat-hf

phwang4 · September 18, 2023, 3:48pm

I’m currently working on a project to give a quick summary of long articles/conversations.

I’m running llama-2-7b-chat-hf with 4bit quantization on an A10 gpu instance

The method I’m using is map_reduce (option 2)from this webpage https://python.langchain.com/docs/use_cases/summarization)

Of everything I’ve tried this is the only one that’s been able to do decent summaries in a reasonable amount of time. However with really long articles (10,000+ words) it takes ~6 minutes before giving an output.

I tried running this same thing on an instance which has 4 A10G gpus but it hasn’t reduced the time by any noticeable amount.

Is there anything else I could be doing to speed this up?

For reference here is the code I’m running in Sagemaker notebook

gist.github.com

https://gist.github.com/phwang4/1ab4d772228b6fff8616c28ac054c229

sagemakerNotebook.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "07ccf04d-2aa7-4850-bf7f-45fd86a3fd4a",
   "metadata": {
    "tags": []
   },
   "outputs": [],

This file has been truncated. show original

Topic		Replies	Views
Llama 3 performance is 4 mins. can get it in seconds? Models	2	491	March 24, 2025
meta-llama/Llama-2-7b-chat-hf weird responses, compared to the ones returned by the HF API 🤗Transformers	1	112	February 2, 2025
Llama 2 10x slower than LLaMA 1 🤗Transformers	1	724	November 7, 2023
Is there a difference between Llama-2-7b-chat-hf and the Sagemaker version? Amazon SageMaker	0	235	March 11, 2024
Having issues with running parallel, independent inferences on multiple GPUs Beginners	0	234	September 10, 2024

Increase summarization speed of llama-2-7b-chat-hf

Related topics