Using facebook/incoder-1B on Inference API

I’m trying to use the facebook/incoder-1B model via the inference API. It has some interesting masking procedures and I’m unable to replicate the results from their codespace demo/python. I wonder if there is some tokenization magic happening that I cannot discern in the inference API.

This prompt was constructed following the paper and python implementation

print("Hello W<|mask:0|>!")<|mask:0|>

and I am returned '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'. The correct output should look like: orld<|endofmask|>. I wonder if there is a tokenizer problem here? Any help would be great, thank you!

Here is the exact code used for inference:

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = f"https://api-inference.huggingface.co/models/facebook/incoder-1B"
data = dict(inputs='print("Hello W<|mask:0|>!")<|mask:0|>', 
            options=dict(use_gpu=False, use_cache=False, wait_for_model=True),
            parameters=dict(return_full_text = False, temperature=0.5, top_p=0.95, do_sample=True))
web_response = requests.request(
    "POST", API_URL, headers=headers, data=json.dumps(data))
response = json.loads(web_response.content.decode("utf-8"))
print(response)

This seems to be because the inference API uses a text-generation pipeline, which doesn’t seem to add the special <|endoftext|> token that should be at the start of every prompt (transformers/text_generation.py at 215e0681e4c3f6ade6e219d022a5e640b42fcb76 · huggingface/transformers · GitHub). [while the tokenizer we’re using will add them by default, unless this add_special_tokens flag is set to false].

I’m not sure whether that’s intended behavior for the pipeline or not, but as a workaround you should be able to pass prefix='<|endoftext|>' in parameters, e.g.

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = f"https://api-inference.huggingface.co/models/facebook/incoder-1B"
data = dict(inputs='print("Hello W<|mask:0|>!")<|mask:0|>', 
            options=dict(use_gpu=False, use_cache=False, wait_for_model=True),
            parameters=dict(return_full_text = False, temperature=0.5, top_p=0.95, do_sample=True, prefix="<|endoftext|>"))
web_response = requests.request(
    "POST", API_URL, headers=headers, data=json.dumps(data))
response = json.loads(web_response.content.decode("utf-8"))
print(response)

I should also say that this example will produce some unexpected text like “Hello Who Are You” rather than “Hello World” because “Hello World” is probably an entire complete token in our vocabulary (since we trained the tokenizer to allow multi-word tokens, and this is probably a common-enough phrase that the tokenizer learns a single token for it), so the token ids for “Hello W” are likely not a prefix of the token ids for “Hello World”). You’ll likely get the best results for infills when trying to infill entire lines, or multiple lines (as we didn’t allow token merges across newlines).

1 Like