SmolAgents: Try to run Agent with local model (mistral)

I have the following code, but when I run it, I get the following error:

“Error in generating model output: local_llama_response() got an unexpected keyword argument ‘stop_sequences’”

I assume that I need to format the prompt in a specific way for the LLaMA model, but I’m not sure how to do it. Could you please assist me?

I’ve also tried passing the model directly to the CodeAgent constructor, but it didn’t work.
Code :

# Load the LLaMA model
llm = Llama(model_path=f"../models/mistral-7b-instruct-v0.1.Q4_K_M.gguf", n_ctx=2048)

# Define a wrapper function to avoid 'stop_sequences' error
def local_llama_response(prompt):
    """Generate a response using the local LLaMA model."""
    response = llm(prompt, max_tokens=100, stop=["\n"])  # Stop passed correctly
    return response["choices"][0]["text"]

agent = CodeAgent(tools=[get_capture_channels_recording_paths,get_recording_server], model=local_llama_response)

while True:
    user_input = input("Type something (or 'bye' to exit): ").strip().lower()
    if user_input == "bye":
        print("Goodbye!")
        break

    #print(f"You said: {user_input}")
    agent.run(user_input)

Thanks!

1 Like

It’s difficult to link it with Llama.cpp because it hasn’t been officially supported by the smolagents side yet.

Regardless, even TransformersModel may not work if max_tokens is small, so that may be the case.

Ok,

If running LLaMA 2 locally isn’t possible yet, are there any other models I can run local on my machine?

Thanks!

1 Like

Actually, it’s possible to do this "without directly loading” GGUF. The idea is to have Ollama load GGUF and run it as a server, and then have the smolagents access it. There seems to be an example on github below. I think there are probably other people who have written guides.

If you want to use GGUF, you might want to try setting up an Ollama server. Ollama itself is easier to use than Llama.cpp.

Also, as I mentioned above, there is a way to use TransformersModel. This is a function that literally loads the model before it is quantized locally. There are many types of models that can be used, but since they are before quantization, they are large in size. I think that support for quantization during loading is currently in progress.

1 Like