Push 4-bit converted model to hub

ckandemir · October 26, 2023, 7:40am

Hey there,

I am quantizing a model to 4-bit using BitsandBytes, and when I try to push the model to the hub I am getting the following error:

You are calling `save_pretrained` on a 4-bit converted model. This is currently not supported

I’ve pushed a 4-bit converted model to the hub before after finetuning it with peft, however I was wondering whether I can do it without going through that path.

I know GPTQ gives me more options, but it only works for text models.

Alternatively I can go through the finetuning and use methods like pruning, distillation to get the model smaller, but I am wondering if there is work around for pushing a 4-bit converted model to the hub

ckandemir · October 27, 2023, 7:28am

Following PRs do handle this issue^

github.com/TimDettmers/bitsandbytes

Save and load in NF4 / FP4 formats

TimDettmers:main ← poedator:save4

opened 08:29PM - 07 Sep 23 UTC

poedator

+353 -96

Purpose: enable saving and loading transformers models in 4bit formats. Enables… this PR in transformers: https://github.com/huggingface/transformers/pull/26037 addresses feature request #603 and other similar ones elsewhere. tested with Bloom-560 and Llama-2-7b. tested quantization, saving, loading, -- matched tensors and quant_states -- matched inference results. tested both nf4 and fp4 test added

github.com/huggingface/transformers

[bnb] Let's make serialization of 4bit models possible

huggingface:main ← poedator:save4

opened 07:37PM - 07 Sep 23 UTC

poedator

+194 -67

## What does this PR do? Purpose: enable saving and loading transformers models… in 4bit formats. tested with Bloom-560 and Llama-2-7b. Save - Load - match tensors and quant_states - match inference results. ## connection with bitsandbytes Requires this PR in bitsandbytes: https://github.com/TimDettmers/bitsandbytes/pull/753 to be able to save/load models ## testing: the functionality was tested doing this series of commands: ``` model = m4 = transformers.AutoModelForCausalLM.from_pretrained(..., [with 4bit quant config]) model.save_pretrained(SAVE_PATH, safe_serialization= [False / True] ) model2 = transformers.AutoModelForCausalLM.from_pretrained( SAVE_PATH) # then matching all params and quant_state items between the models, matching inference results ``` Specific tests will be added to this PR once the [bitsandbytes PR](https://github.com/TimDettmers/bitsandbytes/pull/753) merges. ## Open questions to the maintainers: 1. I added and tested code necessary for straightforward save/load operations with LLMs. Yet there may be other kinds of function calls that may need to be updated to handle 4bit save/load - pls suggest if/where to expand this PR. specifically: this PR covers `load_state_dict_into_meta_model()` but not `load_state_dict_into_model()` for I did not find an example that uses it. 2. Some of my edits may not fit with style / refactoring plans of the maintainers - pls give guidance if needed.

I will just wait until they are merged

system · February 18, 2024, 10:42am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pushing a quantized (4bit) model on the Hub 🤗Transformers	9	4234	January 8, 2024
Problem saving QLORA fine tuned model Beginners	0	150	July 21, 2024
Error Debugging Beginners	1	21	April 29, 2025
Peft model from pretrained load in 8/4 bit 🤗Transformers	6	17529	October 12, 2023
SmolVLM 8bit Quantization Problem Models	3	477	November 29, 2024

Push 4-bit converted model to hub

Related topics