Resetting base_model breaks shared tensors for safetensors

RorryBrenner · March 14, 2025, 1:46pm

Can someone explain how to fix a problem I am facing with safetensors saving? It seems something is being done with base_model in the original model that I need to know how to replicate. The error can be produced with the following:

model = AutoModelForAudioClassification.from_pretrained(
    "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
)
print('Save 1')
save_file(model.state_dict(), 'temp')
print('Save 1 Complete')
model.base_model = model.base_model
print('Save 2')
save_file(model.state_dict(), 'temp')
print('Save 2 Complete')

This outputs:

Save 1                                                                                                                                                                                                                                                                         
Save 1 Complete                                                                                                                                                                                                                                                                
Save 2                                                                                                                                                                                                                                                                         
Traceback (most recent call last):        
.....                                                                                                                                                                                                                         
RuntimeError:                                                                                                                                                                                                                                                                  
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'wav2vec2.masked_spec_embed', 'base_model.masked_spec_embed'},
.....
'base_model.encoder.layers.11.final_layer_norm.bias'}].
            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

My only guess is that there is something being flagged with base_model to tell safetensors to ignore it, but when I reset the variable that flag is getting deleted.

John6666 · March 14, 2025, 2:00pm

This is the easiest way to get around it, but it’s just a workaround.

github.com/hiyouga/LLaMA-Factory

Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'pretrained_model.base_model.model.lm_head.weight', 'pretrained_model.base_model.model.transformer.output_layer.weight'}].

opened 12:55AM - 24 Apr 24 UTC

closed 08:51AM - 24 Apr 24 UTC

zhangjiulong

solved

### Reminder - [X] I have read the README and searched the existing issues. ##…# Reproduction WANDB_DISABLED=1 NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 deepspeed --num_gpus 2 --master_port=9527 /workspace/projects/LLaMA-Factory/src/train_bash.py \ --stage rm \ --do_train \ --deepspeed xxxxxxxx/ds_z3_offload_config.json \ --model_name_or_path xxxxxxx/chatglm3-6b \ --adapter_name_or_path /xxx/chatglm_exp_sft_lora_llamafactory \ --create_new_adapter \ --dataset comparison_gpt4_zh \ --dataset_dir xxx/data \ --template chatglm3 \ --finetuning_type lora \ --lora_target query_key_value \ --output_dir xxx/chatglm_exp_rm_lora_llamafactory \ --overwrite_cache \ --overwrite_output_dir \ --cutoff_len 1024 \ --preprocessing_num_workers 4 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --warmup_steps 20 \ --save_steps 10 \ --eval_steps 20 \ --evaluation_strategy steps \ --learning_rate 1e-5 \ --num_train_epochs 2.0 \ --max_samples 5000 \ --val_size 0.1 \ --plot_loss \ --fp16 可以正常训练，但是保存checkpoint时候提示如下错误： [INFO|trainer.py:3305] 2024-04-23 16:56:46,579 >> Saving model checkpoint to /workspace/models/huggingface/chatglm32k_rm_sft_lora_llamafactory/checkpoint-10 [INFO|trainer.py:3319] 2024-04-23 16:56:46,587 >> Trainer.model is not a `PreTrainedModel`, only saving its state dict. Traceback (most recent call last): File "/workspace/projects/LLaMA-Factory/src/train_bash.py", line 14, in <module> main() File "/workspace/projects/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/workspace/projects/LLaMA-Factory/src/llmtuner/train/tuner.py", line 35, in run_exp run_rm(model_args, data_args, training_args, finetuning_args, callbacks) File "/workspace/projects/LLaMA-Factory/src/llmtuner/train/rm/workflow.py", line 50, in run_rm train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2673, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2752, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3239, in save_model self._save(output_dir, state_dict=state_dict) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3321, in _save safetensors.torch.save_file( File "/opt/conda/lib/python3.10/site-packages/safetensors/torch.py", line 284, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) File "/opt/conda/lib/python3.10/site-packages/safetensors/torch.py", line 480, in _flatten raise RuntimeError( RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'pretrained_model.base_model.model.lm_head.weight', 'pretrained_model.base_model.model.transformer.output_layer.weight'}]. A potential way to correctly save your model is to use `save_model`. More information at https://huggingface.co/docs/safetensors/torch_shared_tensors ### Expected behavior 能够保存rm的checkpoints并顺利完成训练 ### System Info Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points. - `transformers` version: 4.40.0 - Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.31 - Python version: 3.10.11 - Huggingface_hub version: 0.22.2 - Safetensors version: 0.4.3 - Accelerate version: 0.29.3 - Accelerate config: not found - PyTorch version (GPU?): 2.0.1 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: <fill in> - Using distributed or parallel set-up in script?: <fill in> ### Others 无

–save_safetensors False

RorryBrenner · March 14, 2025, 2:53pm

Thanks for the reply, that just causes huggingface to use torch.save instead of safetensors.save_file correct? That has worked for me in other cases but this model seems to have parameterized modules and gives the following error with torch.save

RuntimeError: Serialization of parametrized modules is only supported
through state_dict(). See:
https://pytorch.org/tutorials/beginner/saving_loading_models.html#
saving-loading-a-general-checkpoint-for-inference-and-or-resuming-training

Do you know of any methods other than this workaround?

John6666 · March 14, 2025, 2:57pm

that just causes huggingface to use torch.save instead of safetensors.save_file correct?

true

Perhaps this…?

github.com/pytorch/pytorch

Unable to save model after removing weight normalization due to remaining hooks

opened 04:12PM - 06 Oct 24 UTC

med1844

module: serialization triaged

### 🐛 Describe the bug When converting models for inference, I encountered an i…ssue where I couldn't save the converted model due to remaining hooks. Here's a minimal example reproducing the issue: ```python import torch from torch.nn import Conv1d from torch.nn.utils.parametrizations import weight_norm from torch.nn.utils.parametrize import remove_parametrizations from io import BytesIO conv = weight_norm(Conv1d(4, 4, 3)) print(conv._load_state_dict_pre_hooks) conv = remove_parametrizations(module=conv, tensor_name="weight") print(conv._load_state_dict_pre_hooks) sink = BytesIO() torch.save(conv, sink) # exception ``` Output & exception: ``` OrderedDict([(0, <torch.nn.modules.module._WrappedHook object at 0x7f7c14f01b70>)]) OrderedDict([(0, <torch.nn.modules.module._WrappedHook object at 0x7f7c14f01b70>)]) Traceback (most recent call last): File "test.py", line 12, in <module> torch.save(conv, sink) File ".venv/lib/python3.10/site-packages/torch/serialization.py", line 652, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record) File ".venv/lib/python3.10/site-packages/torch/serialization.py", line 864, in _save pickler.dump(obj) AttributeError: Can't pickle local object 'weight_norm.<locals>._weight_norm_compat_hook' ``` ### Analysis The issue stems from the fact that the removable handle provided by _register_load_state_dict_pre_hook is discarded in the weight_norm function: https://github.com/pytorch/pytorch/blob/0eba7e5451ac53c3e75be258236f3a11acaf2c1c/torch/nn/utils/parametrizations.py#L397 The hook registration function is defined here: https://github.com/pytorch/pytorch/blob/0eba7e5451ac53c3e75be258236f3a11acaf2c1c/torch/nn/modules/module.py#L2248-L2252 ### Possible solution If we could figure out a way to store parameter_name-hook pair in parametrized module, we could then modify `remove_parametrizations` to look up hook that should be removed using the parameter name, and remove the hook. ### Versions ``` PyTorch version: 2.4.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.27.2 Libc version: glibc-2.35 Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.5.119 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3070 Ti Nvidia driver version: 546.12 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5950X 16-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 2 BogoMIPS: 6800.06 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload umip vaes vpclmulqdq rdpid fsrm Virtualization: AMD-V Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 8 MiB (16 instances) L3 cache: 32 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==2.1.2 [pip3] torch==2.4.1 [pip3] triton==3.0.0 [conda] Could not collect ``` cc @mruberry @mikaylagawarecki

Topic		Replies	Views
Saving model in safetensors format through Trainer fails for Gemma 2 due to shared tensors 🤗Transformers	5	1316	September 30, 2024
Using Trainer to save a Bartforsequenceclassification model Beginners	3	2082	August 13, 2024
AutoModelForCausalLM.from_pretrained refuses to load safetensors weights Intermediate	0	951	December 5, 2023
Saving pretrained to same directory as load 🤗Transformers	2	67	April 23, 2025
Problem saving QLORA fine tuned model Beginners	0	150	July 21, 2024

Resetting base_model breaks shared tensors for safetensors

Related topics