Hi, I created a custom handler that loads one of the Llama2’s, combines it with PEFT weights, and was hoping I could deploy the resulting model to an inference endpoint. However I ran into permission issues, even having access to Llama2 (which is gated). I’m wondering if I’m missing something, and would appreciate any help with this. I browsed through the docs but couldn’t find anything specific to this scenario. Thanks in advance!
First questions is did you request access to the LLaMA2 model from META? And if so did you do it with the same email as you HuggingFace account? Once META grants access it should let you pass the Gate within HF.
Beyond that can you share any of the code and or errors you’re getting? Are you trying to deploy a HuggingFace Endpoint or SageMaker Endpoint or something else?
Hi - just tagging onto this thread as I think it’s relevant. Relative newbie, so I hope that’s ok etiquette-wise. I am encountering this same or similar issue trying to deploy a fine-tuned model to an endpoint. I absolutely am authorised for the llama 2 model and can access on HF no problem. Also, I am having no issue with using autotrain on top of the base llama-2-chat model again making me think it’s not a problem with my user.
When I run autotrain I am running with push-to-hub but not merging, so keeping the adapter separate. All works fine end to end but I get a gated repo error when I run the endpoint (alfraser/01-all-products-llama2-chat–mkv in case it matters). I can’t see anywhere to put my credentials in over on the endpoint config.
I have been able to train and merge the model locally, but the hub push seems to consistently fail, I guess because it’s so big (like 20GB or something). Also frankly, I just want to keep the adapter separate if I can because it will speed up my iterations with the hub push, and I think it should work.
Any guidance or thoughts gratefully received. Thanks! Al
Hi - just for posterity I wanted to update my query here. I have been in touch with Hugging Face support (thanks Megan!!). At this time, this is not possible. There is currently no support for secrets management on api endpoints, and so there is nowhere to safely store the token which would be required to dynamically access the base model repo and merge the adapter at run time. Therefore the only option is to merge the adapter during the training process and push the whole model. The downside is having to upload very large files but it’s the only way right now. Secrets management on api endpoints is in the HF roadmap.
Thanks @alfraser for leaving this note. That’s the same suggestion their support gave me when I reached out a while ago, but instead of pushing the whole model I’m pushing pre-trained model + adapter/LoRA weights, and combining them in a custom handler for inference.
Hi @monarchmoney-eng - thanks for the follow up. Have you got a bit more detail on how you have done this, or maybe an example repo that I could look at? I’m keen to try anything which will speed up my workflow and let me iterate more quickly.
In case it matters, I am using HF autotrain locally to train the model. Does your approach involve adding the custom handler to the model repo so the endpoint picks it up when it executes?
Appreciate any pointers