Does llama-2 need pro subscription?

I get the following error when trying to use meta-llama/Llama-2-7b-hf model. I didn’t find any pointers through web search, so asking here. Can someone please help?

llm = HuggingFaceHub(repo_id = “meta-llama/Llama-2-7b-hf”,huggingfacehub_api_token=my_token, model_kwargs={“temperature”:0.5, “max_length”:512})

Generates the following error. If I swap out the repo_id to “google/flan-t5-base”, the code runs fine.

raise ValueError(f"Error raised by inference API: {response[‘error’]}")
ValueError: Error raised by inference API: Model requires a Pro subscription

Thank you for your help!

yes to use it with inference api, you need pro subscription since its too large(13gbish>10gb which is free api limit). Ofcourse you could run it locally without any error.

Thank you, @YaTharThShaRma999! Are there quantized versions I can use through inference API? I see models by ‘TheBloke’ which are smaller than 10GB, but it appears that inference API is turned off for these.

Thank you so much for your help!

You cant use those since those are actually for different libraries and things. Like gptq version is for exllama and autogptq while ggml version are for llama cpp.

If you really want to use a llama model and dont have some sort of gpu, try out llama cpp python.
It uses like 4gb ram max for 7b model(avg phones have like 6gb ram).

If you do have gpu, use autogptq or something which works with transformers as well.

@YaTharThShaRma999 hi sir can you please provide some resources for the mentioned
i wanna implement llama cpp python but using ctransformer library does not provide some functions like
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = “TheBloke/Llama-2-7B-GGML”

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type=“nf4”,
bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True
)
model.config.use_cache = False

like this code i am not able to use in ggml
so i am not able to understand what model should we pick for finetuning ???
is ggml not for finetuning?
if not then whats the uses of ggml ?

can you please answer or give some links for these as i am not able to get

Ggml is for inference but it is kinda possible to train a new model from scratch.

Also, in your code for some reason, you try to load a ggml model(already 4 bit quantized) in 4 bit again?

That is not possible. If you need a bit more information check out ctransformers docs? (Ctransformers is also just for inference and doesn’t have all the things like transformers)

Ggml models are used because they use extremely low ram up and very fast inference in cpu.

hi did you got any solution, found any other model or found any way to use API for free?