Is this CUDA memory error on Inference API coming from HuggingFace or Google Collab?


I am calling the HF Inference API using the code from this article:

When I use the widget HF created there at the top of that page to enter a long prompt (about 1500 tokens) it works fine.

However, when I use the code at the bottom of the article in Google Collab with the same inputs: (I had to add the definition of the options variable. It was erroring without it.)

import json
import requests


def query(payload='',parameters=None,options={'use_cache': False}):
    API_URL = ""
        headers = {"Authorization": f"Bearer {API_TOKEN}"}
    body = {"inputs":payload,'parameters':parameters,'options':options}
    response = requests.request("POST", API_URL, headers=headers, data= json.dumps(body))
    except requests.exceptions.HTTPError:
        return "Error:"+" ".join(response.json()['error'])
      return response.json()[0]['generated_text']

parameters = {

    'max_new_tokens':150,  # number of generated tokens

    'temperature': .3,   # controlling the randomness of generations

    'end_sequence': "###" # stopping sequence for generation


options={'use_cache': False}

prompt="MY BIG LONG PROMPT"             # few-shot prompt

data = query(prompt,parameters,options)

I get this error: “CUDA out of memory, try a smaller payload”

I’m not exactly sure which layers of the stack this error could be coming from. Does the error have to be coming from the HF API, or could it be coming from the Google part of the stack? I don’t want to subscribe to Colab Pro only to find out that was not the problem.

Thank you!