Create an Assistant to be used via Python scripts

Hey there, I’m new to the community.

So far, I played around with HuggingChat and created an Assistant for a specific use case.

I want to be able to create such an Assistant to be available via API calls. So basically:

  • Choose a model (my preference atm is “CohereForAI/c4ai-command-r-plus-08-2024”)
  • Give it initial instructions
  • Publish it

And then, probably via the “transformers” library I want to be able to write messages to that model.

I’m happy to pay for the Pro subscription if that’s necessary. Can you please give me some pointers on how to achieve this?

Never mind, I just found out about Spaces which seems the way to go. Experimenting!

Although it is relatively unprofessional (I’m not an expert on anything to begin with), I think it is this function that is closest to your use case.
Any well-known model on HF can be called with a reasonable probability. I heard that even with a free account you can do 300 requests per hour.
There are some large models you can only call with a Pro subscription.
Without an account, I heard it’s 1 request per hour…

This is an example of actual use and a sample of a usable LLM. The models used in this space are available candidates.

It is also a good idea to get code from an app with a similar use case that someone else has created, rather than building it from scratch.

1 Like

Thanks a lot for your detailed answer!

I tried to create a space and added this to app.py:

import gradio as gr
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "CohereForAI/c4ai-command-r-plus-08-2024"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

interface = gr.Interface(fn=generate_response, inputs="text", outputs="text")
interface.launch()

After that, the space rebuilds the container but I get this error:

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024.
401 Client Error. (Request ID: Root=1-66ee97d2-2830f72d701dd7eb35d0359f;5d578753-6b5d-40e5-b832-6dc565ed94e8)

Cannot access gated repo for url https://huggingface.co/CohereForAI/c4ai-command-r-plus-08-2024/resolve/main/config.json.
Access to model CohereForAI/c4ai-command-r-plus-08-2024 is restricted. You must have access to it and be authenticated to access it. Please log in.

I know how to create a token, but don’t know how I can assign one to this space. Can you help me with that?

I found the answer myself: Gated models need additional permissions to be used.

I just went to CohereForAI/c4ai-command-r-plus-08-2024 · Hugging Face, filled out the form and got access.

1 Like

I know how to create a token, but don’t know how I can assign one to this space. Can you help me with that?

You can use it directly in your code, but it is dangerous because the tokens are fully visible, so Space has a Secret environment variable setting facility.
After setting the token in the Secret, it is useful to retrieve it as follows.
Be aware that normal environment variables are still fully visible.

import os
hf_token = os.environ.get("HF_TOKEN")
1 Like

Yes, I did that and it retrieves the secret properly from the system’s env vars.

Now my problem is which model to pick that fits into ZeroGPU.

CohereForAI/c4ai-command-r-plus-08-2024 is too big. I’d like to find something with similar capabilities because my initial instructions are quite complex.

Do you load on the VRAM of your own space?
If so, you can use a BNB with 4-bit quantisation and you only need a quarter of the VRAM. You can use a larger model for that.

My actual code

        from transformers import BitsAndBytesConfig
        nf4_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                                        bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)

text_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config, device_map=device, torch_dtype=torch.bfloat16).eval()

I don’t fully understand what you mean by that. The space is running on Hugging Face infrastructure. The code suggests that it’s using the space’s VRAM, but I’m not sure:

tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)
model = AutoModelForCausalLM.from_pretrained(model_id, token=hf_token)

def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

I will try the 4-bit quantisation, thanks for the code example!

Models that are too big might be easier this way: if you only have HF, a token and an internet connection, it will even work on a smartphone.
But you can do a lot more if you load it on hand, like your code.
If you do it via an API, the functionality is inevitably limited.

from huggingface_hub import InferenceClient
client = InferenceClient("meta-llama/Meta-Llama-3.1-70B-Instruct", token=hf_token)

system_message = "You are a helpful assistant. Try your best to give the best response possible to the user."
            user_message = f"{base_prompt}\nDescription: {input_text}"

            messages = [
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ]

            response = client.chat_completion( 
                model="meta-llama/Meta-Llama-3.1-70B-Instruct",
                max_tokens=1024,
                temperature=0.7,
                top_p=0.95,
                messages=messages,
            )
1 Like

I was able to find a great space and duplicated it. The copy works but I reached my usage limit pretty quickly while using the text box in the space UI:

You have exceeded your GPU quota (60s requested vs. 40s left). Create a free account to get more usage quota.

I am already logged in and just subscribed to Pro. Is this normal?

I will try your suggestion with the InferenceClient now.

The InferenceClient is working perfectly and is exactly what I needed.

Do you anything about rate limits for Pro accounts?

I am already logged in and just subscribed to Pro. Is this normal?

I was also mistaken until recently, but it seems that “sign in” in this case refers to the following functions.
It seems to be different from a login or a token. Well, if you change one line in app.py and one line in README.md, you can sign in anyway.
Can’t help if it’s someone else’s space…

I heard that Pro accounts have 5x Quota, but I don’t know if this is true.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.