Saving model in safetensors format through Trainer fails for Gemma 2 due to shared tensors

oran-sh · September 30, 2024, 8:04am

Hello,
I am finetuning google/gemma-2-2b and these are the arguments and trainer call:


model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b", token=token, attn_implementation='eager')

training_args = TrainingArguments(
    output_dir=args.log_dir,
    num_train_epochs=args.epochs,
    per_device_train_batch_size=args.train_batch_size,
    per_device_eval_batch_size=args.eval_batch_size,
    warmup_steps=args.warmup_steps,
    learning_rate=args.learning_rate,
    evaluation_strategy="no",
    logging_dir=args.log_dir,
    logging_steps=50,
    save_strategy="steps",
    save_steps=2000,
    report_to="mlflow",
    run_name=args.run_name,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

I am getting the following error when trainer tries to save the model:

RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'text_model.model.embed_tokens.weight', 'text_model.lm_head.weight'}].
            A potential way to correctly save your model is to use `save_model`.

I hav ecurrently disabled saving as safetensors through the training arguments:
save_safetensors=False,
would be happy to get your take on this and how to handle this issue.

Thanks!

John6666 · September 30, 2024, 8:13am

It turned out to be a possible unresolved bug. The workaround seems to be to not save in safetensors format like you did, but that’s not a solution…
So this function has been buggy since 2023…

github.com/hiyouga/LLaMA-Factory

Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'pretrained_model.base_model.model.lm_head.weight', 'pretrained_model.base_model.model.transformer.output_layer.weight'}].

opened 12:55AM - 24 Apr 24 UTC

closed 08:51AM - 24 Apr 24 UTC

zhangjiulong

solved

### Reminder - [X] I have read the README and searched the existing issues. ##…# Reproduction WANDB_DISABLED=1 NCCL_P2P_DISABLE=1 NCCL_IB_DISABLE=1 deepspeed --num_gpus 2 --master_port=9527 /workspace/projects/LLaMA-Factory/src/train_bash.py \ --stage rm \ --do_train \ --deepspeed xxxxxxxx/ds_z3_offload_config.json \ --model_name_or_path xxxxxxx/chatglm3-6b \ --adapter_name_or_path /xxx/chatglm_exp_sft_lora_llamafactory \ --create_new_adapter \ --dataset comparison_gpt4_zh \ --dataset_dir xxx/data \ --template chatglm3 \ --finetuning_type lora \ --lora_target query_key_value \ --output_dir xxx/chatglm_exp_rm_lora_llamafactory \ --overwrite_cache \ --overwrite_output_dir \ --cutoff_len 1024 \ --preprocessing_num_workers 4 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 4 \ --lr_scheduler_type cosine \ --logging_steps 10 \ --warmup_steps 20 \ --save_steps 10 \ --eval_steps 20 \ --evaluation_strategy steps \ --learning_rate 1e-5 \ --num_train_epochs 2.0 \ --max_samples 5000 \ --val_size 0.1 \ --plot_loss \ --fp16 可以正常训练，但是保存checkpoint时候提示如下错误： [INFO|trainer.py:3305] 2024-04-23 16:56:46,579 >> Saving model checkpoint to /workspace/models/huggingface/chatglm32k_rm_sft_lora_llamafactory/checkpoint-10 [INFO|trainer.py:3319] 2024-04-23 16:56:46,587 >> Trainer.model is not a `PreTrainedModel`, only saving its state dict. Traceback (most recent call last): File "/workspace/projects/LLaMA-Factory/src/train_bash.py", line 14, in <module> main() File "/workspace/projects/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/workspace/projects/LLaMA-Factory/src/llmtuner/train/tuner.py", line 35, in run_exp run_rm(model_args, data_args, training_args, finetuning_args, callbacks) File "/workspace/projects/LLaMA-Factory/src/llmtuner/train/rm/workflow.py", line 50, in run_rm train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2278, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2673, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2752, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3239, in save_model self._save(output_dir, state_dict=state_dict) File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 3321, in _save safetensors.torch.save_file( File "/opt/conda/lib/python3.10/site-packages/safetensors/torch.py", line 284, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) File "/opt/conda/lib/python3.10/site-packages/safetensors/torch.py", line 480, in _flatten raise RuntimeError( RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'pretrained_model.base_model.model.lm_head.weight', 'pretrained_model.base_model.model.transformer.output_layer.weight'}]. A potential way to correctly save your model is to use `save_model`. More information at https://huggingface.co/docs/safetensors/torch_shared_tensors ### Expected behavior 能够保存rm的checkpoints并顺利完成训练 ### System Info Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points. - `transformers` version: 4.40.0 - Platform: Linux-5.15.0-101-generic-x86_64-with-glibc2.31 - Python version: 3.10.11 - Huggingface_hub version: 0.22.2 - Safetensors version: 0.4.3 - Accelerate version: 0.29.3 - Accelerate config: not found - PyTorch version (GPU?): 2.0.1 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using GPU in script?: <fill in> - Using distributed or parallel set-up in script?: <fill in> ### Others 无

github.com/d8ahazard/sd_dreambooth_extension

[Bug]: On model training: RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again

opened 08:26AM - 05 Jun 23 UTC

closed 12:31AM - 30 Oct 23 UTC

AdrianStrugala

Stale

### Is there an existing issue for this? - [X] I have searched the existing iss…ues and checked the recent builds/commits of both this extension and the webui ### What happened? Runtime exception pop up on model training. It is breaking the execution: RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again. More info in Console Logs ### Steps to reproduce the problem 1. Create a model based on realisticVisionV20_v20NoVAE 2. Goto Concepts, put Dataset Directory, instance and class prompts 3. Press 'Training Wizard: (Person)' 4. Press 'Train' ### Commit and libraries Initializing Dreambooth Dreambooth revision: b396af26b7906aa82a29d8847e756396cb2c28fb Successfully installed accelerate-0.19.0 fastapi-0.94.1 gitpython-3.1.31 transformers-4.29.2 Does your project take forever to startup? Repetitive dependency installation may be the reason. Automatic1111's base project sets strict requirements on outdated dependencies. If an extension is using a newer version, the dependency is uninstalled and reinstalled twice every startup. [+] xformers version 0.0.17 installed. [+] torch version 2.0.1+cu118 installed. [+] torchvision version 0.15.2+cu118 installed. [+] accelerate version 0.19.0 installed. [+] diffusers version 0.16.1 installed. [+] transformers version 4.29.2 installed. [+] bitsandbytes version 0.35.4 installed. Launching Web UI with arguments: --xformers Loading weights [c0d1994c73] from D:\Workspace\Stable diffusion\stable-diffusion-webui\models\Stable-diffusion\realisticVisionV20_v20NoVAE.safetensors Creating model from config: D:\Workspace\Stable diffusion\stable-diffusion-webui\configs\v1-inference.yaml LatentDiffusion: Running in eps-prediction mode DiffusionWrapper has 859.52 M params. ### Command Line Arguments ```Shell set COMMANDLINE_ARGS= --xformers ``` ### Console logs ```Shell Launching Web UI with arguments: --xformers Loading weights [c0d1994c73] from D:\Workspace\Stable diffusion\stable-diffusion-webui\models\Stable-diffusion\realisticVisionV20_v20NoVAE.safetensors Creating model from config: D:\Workspace\Stable diffusion\stable-diffusion-webui\configs\v1-inference.yaml LatentDiffusion: Running in eps-prediction mode DiffusionWrapper has 859.52 M params. Textual inversion embeddings loaded(4): breasts, EasyNegative, small_tits, Style-Unshaved Model loaded in 3.5s (load weights from disk: 0.2s, create model: 0.4s, apply weights to model: 0.7s, apply half(): 0.6s, move model to device: 0.6s, load textual inversion embeddings: 0.9s). Applying optimization: xformers... done. CUDA SETUP: Loading binary D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cudaall.dll... Running on local URL: http://127.0.0.1:7860 To create a public link, set `share=True` in `launch()`. Startup time: 8.6s (import torch: 1.0s, import gradio: 0.8s, import ldm: 0.3s, other imports: 0.7s, load scripts: 4.9s, create ui: 0.7s, gradio launch: 0.1s). Total images: 27 Largest prime: 3 Best factors: (3, 9) Total VRAM: 12 Wizard results:<br>Num Epochs: 150<br>Num instance images per class image: 5 Exception loading config: Expecting value: line 1 column 1 (char 0) Traceback (most recent call last): File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\dataclasses\db_config.py", line 411, in from_file config_dict = json.load(openfile) File "C:\Users\adist\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 293, in load return loads(fp.read(), File "C:\Users\adist\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 346, in loads return _default_decoder.decode(s) File "C:\Users\adist\AppData\Local\Programs\Python\Python310\lib\json\decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Users\adist\AppData\Local\Programs\Python\Python310\lib\json\decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) Duration: 00:00:00 Error completing request Arguments: ('test.model', 'Native Diffusers') {} Traceback (most recent call last): File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\utils\utils.py", line 200, in f res = func(*args, **kwargs) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\ui_functions.py", line 683, in start_training if config.pretrained_vae_name_or_path == "": AttributeError: 'NoneType' object has no attribute 'pretrained_vae_name_or_path' Traceback (most recent call last): File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\gradio\routes.py", line 422, in run_predict output = await app.get_blocks().process_api( File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 1326, in process_api data = self.postprocess_data(fn_index, result["prediction"], state) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 1229, in postprocess_data self.validate_outputs(fn_index, predictions) # type: ignore File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\gradio\blocks.py", line 1204, in validate_outputs raise ValueError( ValueError: An event handler (f) didn't receive enough output values (needed: 5, received: 3). Wanted outputs: [dropdown, html, html, gallery, html] Received outputs: [None, "", "<div class='error'>AttributeError: 'NoneType' object has no attribute 'pretrained_vae_name_or_path'</div>"] Wizard results:<br>Num Epochs: 150<br>Num instance images per class image: 5 Initializing dreambooth training... Pre-processing images: classifiers_0: : 54it [00:00, 558.49it/s] We need a total of 135 class images.: : 54it [00:00, 564.32it/s] | 0/27 [00:00<?, ?it/s] Generating 135 class images for training... Using scheduler: DEISMultistep:: 0%| | 0/135 [00:00<?, ?it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:05<00:00, 7.16it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.85it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:02<00:00, 14.19it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:02<00:00, 14.37it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:02<00:00, 14.11it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.66it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.65it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.82it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.44it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.80it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.64it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.88it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.52it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.77it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.69it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.40it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.51it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.36it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.13it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 10.72it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.94it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.67it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.08it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.40it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.63it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.00it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.00it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.30it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.49it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.44it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.08it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.42it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.33it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.48it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.37it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.55it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.49it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.08it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.10it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.05it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.27it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.35it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.43it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.30it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.56it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.57it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.42it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.49it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.32it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.52it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.44it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.07it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.59it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.65it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.57it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.52it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.35it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.51it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.38it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.34it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.51it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.43it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.42it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.39it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.48it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.41it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.43it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.40it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.41it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.31it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.44it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.38it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.47it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.33it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.65it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.63it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.45it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.55it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.58it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.31it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.50it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.57it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.44it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.32it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.56it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.52it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.27it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.62it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.07it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.46it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.79it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.65it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:02<00:00, 13.48it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:02<00:00, 14.37it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 10.69it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.92it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.82it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.38it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 10.21it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.29it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.15it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.59it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.13it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.36it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.22it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.30it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.45it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.37it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.35it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.56it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.58it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.38it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.29it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.68it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.00it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.15it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.23it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.88it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.71it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.15it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.59it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.60it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.51it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.95it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.16it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.17it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.15it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.26it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 12.26it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.49it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.74it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.43it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.24it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.60it/s] 100%|██████████████████████████████████████████████████████████████████████████████████| 40/40 [00:03<00:00, 11.54it/s] Generating class images 134/135:: 99%|█████████████████████████████████████████▋| 134/135 [00:00<00:00, 134041.67it/s]Restored system models.s 135/135:: 100%|██████████████████████████████████████████| 135/135 [00:00<00:00, 134945.43it/s] Generated 135 new class images. Enabling xformers memory efficient attention for unet | 0/135 [00:00<?, ?it/s] Enabling xformers memory efficient attention for unet Found 135 reg images.%| | 0/135 [00:00<?, ?it/s] Preparing dataset... Init dataset! Preparing Dataset (Without Caching) Bucket 0 (512, 512, 0) - Instance Images: 27 | Class Images: 135 | Max Examples/batch: 54 Total Buckets 1 - Instance Images: 27 | Class Images: 135 | Max Examples/batch: 54 Total images / batch: 54, total examples: 54█████████████████████████████████████| 162/162 [00:00<00:00, 162011.74it/s] Total dataset length (steps): 54 Initializing bucket counter! Steps: 3%| | 270/8100 [05:33<1:13:52, 1.77it/s, inst_loss=0, loss=0.00184, lr=2e-6, prior_loss=0.00245, vram=10.3]Exception saving sample.%| | 0/3 [00:00<?, ?it/s] Traceback (most recent call last): File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1032, in save_weights s_image = s_pipeline( File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py", line 645, in __call__ prompt_embeds = self._encode_prompt( File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\diffusers\pipelines\stable_diffusion\pipeline_stable_diffusion.py", line 357, in _encode_prompt prompt_embeds = self.text_encoder( File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 816, in forward return self.text_model( File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 717, in forward causal_attention_mask = self._build_causal_attention_mask( File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\transformers\models\clip\modeling_clip.py", line 760, in _build_causal_attention_mask mask.triu_(1) # zero out the lower diagonal RuntimeError: "triu_tril_cuda_template" not implemented for 'BFloat16' Model name: test.model Saving D:\Workspace\Stable diffusion\stable-diffusion-webui\models\dreambooth\test.model\logging\loss_plot_0.png Saving D:\Workspace\Stable diffusion\stable-diffusion-webui\models\dreambooth\test.model\logging\ram_plot_0.png Cleanup log parse. Steps: 7%|▌ | 540/8100 [12:48<1:14:41, 1.69it/s, inst_loss=0, loss=0.12, lr=2e-6, prior_loss=0.16, vram=10.2]Traceback (most recent call last): File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\ui_functions.py", line 729, in start_training result = main(class_gen_method=class_gen_method) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1546, in main return inner_loop() File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\memory.py", line 119, in decorator return function(batch_size, grad_size, prof, *args, **kwargs) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 1500, in inner_loop check_save(True) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 794, in check_save save_weights( File "D:\Workspace\Stable diffusion\stable-diffusion-webui\extensions\sd_dreambooth_extension\dreambooth\train_dreambooth.py", line 970, in save_weights s_pipeline.save_pretrained(tmp_dir, safe_serialization=True) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\diffusers\pipelines\pipeline_utils.py", line 607, in save_pretrained save_method(os.path.join(save_directory, pipeline_component_name), **save_kwargs) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 319, in save_pretrained safetensors.torch.save_file( File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\safetensors\torch.py", line 232, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) File "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\lib\site-packages\safetensors\torch.py", line 394, in _flatten raise RuntimeError( RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'decoder.conv_in.weight', 'encoder.down_blocks.2.resnets.0.conv1.bias', 'decoder.up_blocks.2.resnets.2.conv2.weight', 'encoder.down_blocks.1.resnets.1.conv2.weight', 'decoder.up_blocks.1.resnets.1.conv1.weight', 'encoder.conv_out.bias', 'encoder.down_blocks.0.resnets.1.norm2.weight', 'decoder.up_blocks.1.resnets.0.norm2.weight', 'decoder.up_blocks.2.resnets.0.norm1.weight', 'decoder.conv_in.bias', 'encoder.down_blocks.2.resnets.0.conv1.weight', 'decoder.up_blocks.2.resnets.0.conv2.weight', 'decoder.up_blocks.2.resnets.2.norm1.weight', 'decoder.up_blocks.3.resnets.1.conv2.weight', 'encoder.mid_block.attentions.0.group_norm.bias', 'decoder.up_blocks.0.resnets.0.norm1.bias', 'decoder.up_blocks.0.resnets.1.conv2.bias', 'decoder.up_blocks.1.resnets.0.conv1.weight', 'decoder.mid_block.resnets.0.norm1.weight', 'decoder.up_blocks.0.resnets.1.norm2.weight', 'encoder.down_blocks.1.resnets.1.norm2.weight', 'encoder.down_blocks.1.resnets.1.norm2.bias', 'encoder.down_blocks.2.resnets.0.conv2.bias', 'encoder.mid_block.resnets.0.conv1.bias', 'encoder.down_blocks.0.resnets.0.conv1.weight', 'decoder.up_blocks.0.resnets.2.conv2.bias', 'decoder.up_blocks.1.resnets.0.conv2.bias', 'decoder.up_blocks.2.upsamplers.0.conv.weight', 'decoder.up_blocks.0.resnets.2.conv2.weight', 'encoder.down_blocks.0.resnets.1.norm2.bias', 'decoder.up_blocks.3.resnets.2.norm1.weight', 'decoder.mid_block.resnets.1.norm2.weight', 'decoder.up_blocks.3.resnets.2.conv1.bias', 'decoder.mid_block.attentions.0.query.bias', 'encoder.down_blocks.1.resnets.0.conv2.weight', 'decoder.up_blocks.3.resnets.1.norm1.weight', 'decoder.up_blocks.0.resnets.1.norm1.bias', 'encoder.down_blocks.0.downsamplers.0.conv.bias', 'post_quant_conv.bias', 'encoder.mid_block.resnets.0.norm2.bias', 'decoder.up_blocks.0.resnets.2.norm1.weight', 'decoder.up_blocks.2.resnets.2.conv2.bias', 'decoder.mid_block.attentions.0.key.bias', 'decoder.up_blocks.3.resnets.0.conv2.bias', 'encoder.mid_block.attentions.0.proj_attn.weight', 'encoder.down_blocks.0.resnets.0.norm2.weight', 'encoder.mid_block.resnets.0.conv1.weight', 'encoder.mid_block.resnets.1.norm1.bias', 'encoder.down_blocks.0.resnets.1.norm1.weight', 'encoder.conv_norm_out.weight', 'decoder.up_blocks.2.resnets.1.conv1.weight', 'encoder.mid_block.resnets.1.norm1.weight', 'encoder.down_blocks.1.resnets.1.conv2.bias', 'decoder.up_blocks.1.resnets.2.conv2.weight', 'encoder.down_blocks.1.resnets.0.norm1.weight', 'encoder.down_blocks.2.resnets.1.conv1.weight', 'decoder.up_blocks.3.resnets.0.conv2.weight', 'encoder.down_blocks.1.downsamplers.0.conv.weight', 'encoder.down_blocks.2.resnets.0.conv_shortcut.weight', 'encoder.down_blocks.2.resnets.0.norm2.bias', 'encoder.mid_block.attentions.0.proj_attn.bias', 'decoder.up_blocks.0.resnets.0.conv2.weight', 'decoder.up_blocks.2.upsamplers.0.conv.bias', 'decoder.up_blocks.3.resnets.0.conv_shortcut.weight', 'decoder.mid_block.resnets.0.conv2.weight', 'encoder.down_blocks.3.resnets.0.conv2.bias', 'decoder.up_blocks.3.resnets.2.norm2.weight', 'encoder.down_blocks.1.resnets.0.conv2.bias', 'encoder.down_blocks.2.resnets.0.norm2.weight', 'encoder.down_blocks.2.downsamplers.0.conv.weight', 'encoder.mid_block.attentions.0.value.weight', 'decoder.up_blocks.2.resnets.0.conv1.bias', 'decoder.up_blocks.3.resnets.1.norm2.weight', 'encoder.down_blocks.3.resnets.1.conv1.bias', 'encoder.down_blocks.1.resnets.0.conv_shortcut.weight', 'encoder.down_blocks.3.resnets.0.norm2.weight', 'encoder.mid_block.resnets.0.norm1.weight', 'decoder.up_blocks.1.resnets.0.norm2.bias', 'decoder.up_blocks.0.upsamplers.0.conv.bias', 'decoder.up_blocks.0.resnets.1.conv1.weight', 'decoder.up_blocks.3.resnets.0.conv1.weight', 'decoder.up_blocks.3.resnets.1.norm2.bias', 'encoder.down_blocks.0.resnets.0.norm2.bias', 'decoder.mid_block.attentions.0.value.bias', 'encoder.down_blocks.3.resnets.0.norm2.bias', 'decoder.up_blocks.0.resnets.0.conv1.weight', 'decoder.up_blocks.1.resnets.1.conv1.bias', 'decoder.up_blocks.1.resnets.2.norm2.weight', 'encoder.down_blocks.2.resnets.1.norm1.bias', 'encoder.down_blocks.1.resnets.1.norm1.weight', 'encoder.down_blocks.2.resnets.1.norm2.weight', 'encoder.down_blocks.1.downsamplers.0.conv.bias', 'encoder.conv_in.weight', 'decoder.up_blocks.2.resnets.0.conv_shortcut.bias', 'decoder.mid_block.attentions.0.query.weight', 'decoder.up_blocks.3.resnets.2.conv2.bias', 'quant_conv.bias', 'encoder.down_blocks.1.resnets.1.norm1.bias', 'decoder.up_blocks.2.resnets.1.norm2.bias', 'decoder.mid_block.resnets.0.norm2.weight', 'decoder.up_blocks.2.resnets.1.norm1.bias', 'encoder.conv_out.weight', 'decoder.up_blocks.2.resnets.0.conv1.weight', 'post_quant_conv.weight', 'encoder.down_blocks.3.resnets.1.norm2.bias', 'decoder.up_blocks.2.resnets.1.norm2.weight', 'decoder.mid_block.resnets.0.conv2.bias', 'encoder.down_blocks.1.resnets.1.conv1.bias', 'encoder.down_blocks.3.resnets.0.conv2.weight', 'decoder.up_blocks.3.resnets.1.norm1.bias', 'encoder.down_blocks.2.resnets.0.norm1.bias', 'decoder.up_blocks.1.resnets.2.conv1.weight', 'decoder.up_blocks.1.resnets.2.conv2.bias', 'encoder.down_blocks.0.resnets.0.conv2.bias', 'decoder.up_blocks.2.resnets.2.conv1.weight', 'encoder.down_blocks.0.downsamplers.0.conv.weight', 'encoder.down_blocks.0.resnets.0.norm1.bias', 'encoder.down_blocks.2.resnets.1.norm2.bias', 'decoder.up_blocks.3.resnets.2.conv2.weight', 'decoder.mid_block.attentions.0.proj_attn.weight', 'encoder.down_blocks.0.resnets.1.conv2.bias', 'encoder.down_blocks.3.resnets.1.norm1.bias', 'decoder.conv_norm_out.bias', 'encoder.down_blocks.1.resnets.1.conv1.weight', 'decoder.mid_block.attentions.0.group_norm.bias', 'encoder.conv_norm_out.bias', 'encoder.down_blocks.3.resnets.1.norm1.weight', 'encoder.mid_block.resnets.1.norm2.weight', 'encoder.down_blocks.0.resnets.1.norm1.bias', 'encoder.down_blocks.0.resnets.1.conv2.weight', 'decoder.up_blocks.3.resnets.2.conv1.weight', 'encoder.mid_block.resnets.1.conv1.bias', 'decoder.mid_block.resnets.0.conv1.weight', 'decoder.conv_out.bias', 'decoder.up_blocks.0.resnets.2.conv1.weight', 'decoder.up_blocks.2.resnets.1.conv2.bias', 'encoder.mid_block.attentions.0.group_norm.weight', 'decoder.up_blocks.0.resnets.0.norm2.weight', 'decoder.conv_out.weight', 'encoder.down_blocks.0.resnets.1.conv1.weight', 'decoder.up_blocks.2.resnets.0.conv2.bias', 'decoder.up_blocks.1.resnets.2.conv1.bias', 'decoder.up_blocks.0.resnets.1.conv2.weight', 'encoder.down_blocks.1.resnets.0.norm2.bias', 'decoder.up_blocks.0.resnets.1.norm2.bias', 'decoder.up_blocks.3.resnets.0.norm2.weight', 'encoder.down_blocks.1.resnets.0.conv1.weight', 'encoder.mid_block.attentions.0.value.bias', 'decoder.up_blocks.0.resnets.0.norm2.bias', 'decoder.up_blocks.0.resnets.2.norm2.bias', 'decoder.up_blocks.2.resnets.2.norm2.bias', 'decoder.mid_block.attentions.0.value.weight', 'decoder.mid_block.resnets.1.conv1.bias', 'encoder.mid_block.attentions.0.query.bias', 'decoder.up_blocks.0.resnets.1.conv1.bias', 'decoder.up_blocks.3.resnets.1.conv2.bias', 'encoder.down_blocks.3.resnets.1.conv2.bias', 'decoder.up_blocks.1.resnets.0.norm1.weight', 'encoder.down_blocks.1.resnets.0.norm2.weight', 'decoder.up_blocks.2.resnets.0.norm2.bias', 'decoder.up_blocks.0.resnets.2.norm2.weight', 'decoder.up_blocks.1.resnets.1.norm1.bias', 'decoder.mid_block.resnets.1.norm2.bias', 'encoder.mid_block.attentions.0.query.weight', 'decoder.up_blocks.0.resnets.0.conv1.bias', 'decoder.up_blocks.1.resnets.0.conv1.bias', 'encoder.down_blocks.0.resnets.1.conv1.bias', 'decoder.up_blocks.2.resnets.2.norm2.weight', 'decoder.up_blocks.1.upsamplers.0.conv.bias', 'decoder.mid_block.attentions.0.group_norm.weight', 'decoder.mid_block.attentions.0.proj_attn.bias', 'decoder.mid_block.attentions.0.key.weight', 'decoder.up_blocks.0.resnets.0.norm1.weight', 'decoder.up_blocks.2.resnets.0.conv_shortcut.weight', 'encoder.mid_block.attentions.0.key.bias', 'decoder.up_blocks.1.resnets.0.conv2.weight', 'decoder.conv_norm_out.weight', 'encoder.down_blocks.2.resnets.0.norm1.weight', 'encoder.down_blocks.1.resnets.0.conv1.bias', 'decoder.up_blocks.2.resnets.0.norm2.weight', 'decoder.up_blocks.3.resnets.0.conv1.bias', 'decoder.up_blocks.1.upsamplers.0.conv.weight', 'encoder.down_blocks.2.downsamplers.0.conv.bias', 'encoder.down_blocks.1.resnets.0.norm1.bias', 'encoder.mid_block.resnets.1.norm2.bias', 'decoder.up_blocks.3.resnets.1.conv1.weight', 'encoder.mid_block.resnets.0.conv2.weight', 'decoder.up_blocks.0.resnets.2.conv1.bias', 'encoder.down_blocks.3.resnets.0.norm1.weight', 'decoder.mid_block.resnets.0.norm2.bias', 'quant_conv.weight', 'decoder.mid_block.resnets.0.conv1.bias', 'decoder.up_blocks.0.resnets.1.norm1.weight', 'encoder.down_blocks.0.resnets.0.conv2.weight', 'decoder.up_blocks.1.resnets.1.norm1.weight', 'encoder.down_blocks.3.resnets.1.conv1.weight', 'encoder.mid_block.resnets.0.norm2.weight', 'decoder.up_blocks.1.resnets.1.norm2.weight', 'decoder.up_blocks.3.resnets.2.norm2.bias', 'decoder.up_blocks.1.resnets.2.norm1.bias', 'encoder.down_blocks.2.resnets.0.conv_shortcut.bias', 'encoder.mid_block.attentions.0.key.weight', 'decoder.up_blocks.1.resnets.2.norm2.bias', 'encoder.mid_block.resnets.1.conv1.weight', 'decoder.up_blocks.0.resnets.2.norm1.bias', 'decoder.up_blocks.2.resnets.0.norm1.bias', 'decoder.mid_block.resnets.0.norm1.bias', 'encoder.mid_block.resnets.1.conv2.weight', 'encoder.mid_block.resnets.0.norm1.bias', 'decoder.up_blocks.2.resnets.1.norm1.weight', 'decoder.up_blocks.3.resnets.1.conv1.bias', 'decoder.up_blocks.2.resnets.1.conv2.weight', 'decoder.mid_block.resnets.1.norm1.weight', 'decoder.up_blocks.1.resnets.1.conv2.bias', 'encoder.down_blocks.3.resnets.0.conv1.bias', 'decoder.up_blocks.1.resnets.2.norm1.weight', 'encoder.down_blocks.3.resnets.0.norm1.bias', 'encoder.down_blocks.3.resnets.0.conv1.weight', 'decoder.mid_block.resnets.1.norm1.bias', 'encoder.down_blocks.2.resnets.1.conv2.bias', 'decoder.up_blocks.2.resnets.2.norm1.bias', 'decoder.up_blocks.3.resnets.0.norm1.bias', 'decoder.mid_block.resnets.1.conv2.bias', 'encoder.mid_block.resnets.0.conv2.bias', 'encoder.mid_block.resnets.1.conv2.bias', 'encoder.down_blocks.0.resnets.0.norm1.weight', 'encoder.down_blocks.3.resnets.1.conv2.weight', 'encoder.down_blocks.2.resnets.0.conv2.weight', 'decoder.up_blocks.0.upsamplers.0.conv.weight', 'decoder.up_blocks.2.resnets.1.conv1.bias', 'decoder.up_blocks.3.resnets.0.norm1.weight', 'decoder.up_blocks.3.resnets.0.norm2.bias', 'encoder.conv_in.bias', 'encoder.down_blocks.1.resnets.0.conv_shortcut.bias', 'decoder.up_blocks.2.resnets.2.conv1.bias', 'decoder.up_blocks.1.resnets.0.norm1.bias', 'decoder.up_blocks.0.resnets.0.conv2.bias', 'decoder.up_blocks.3.resnets.2.norm1.bias', 'decoder.mid_block.resnets.1.conv1.weight', 'decoder.up_blocks.1.resnets.1.conv2.weight', 'encoder.down_blocks.0.resnets.0.conv1.bias', 'encoder.down_blocks.2.resnets.1.conv2.weight', 'encoder.down_blocks.2.resnets.1.norm1.weight', 'decoder.up_blocks.3.resnets.0.conv_shortcut.bias', 'decoder.mid_block.resnets.1.conv2.weight', 'encoder.down_blocks.3.resnets.1.norm2.weight', 'encoder.down_blocks.2.resnets.1.conv1.bias', 'decoder.up_blocks.1.resnets.1.norm2.bias'}]. A potential way to correctly save your model is to use `save_model`. More information at https://huggingface.co/docs/safetensors/torch_shared_tensors Steps: 7%|▌ | 540/8100 [12:49<2:59:32, 1.42s/it, inst_loss=0, loss=0.12, lr=2e-6, prior_loss=0.16, vram=10.2] Saving weights/samples...: 0%| | 0/3 [00:01<?, ?it/s] Restored system models. Duration: 00:22:20 ``` ### Additional information Windows 11, venv "D:\Workspace\Stable diffusion\stable-diffusion-webui\venv\Scripts\Python.exe" Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] Version: v1.3.2

oran-sh · September 30, 2024, 8:16am

Indeed a bug. Should I post this on github or wait for a response here first?

John6666 · September 30, 2024, 8:21am

I don’t think the HF library developers are looking at this forum and post properly. Maybe a post on github or in the HF repo’s Discussion if there is one would be preferable. I don’t have a github account at the moment, so if possible, please do.
If we know the exact maintainer, we could send a mention with @+username, but in this case the person in charge is unknown.

oran-sh · September 30, 2024, 8:34am

I opened a bug in the transformers repo:

github.com/huggingface/transformers

Saving model in safetensors format through Trainer fails for Gemma 2 due to shared tensors

opened 08:33AM - 30 Sep 24 UTC

oranshayer

bug

### System Info - `transformers` version: 4.44.2 - Platform: Linux-5.10.220-20…9.869.amzn2.x86_64-x86_64-with-glibc2.26 - Python version: 3.10.14 - Huggingface_hub version: 0.25.1 - Safetensors version: 0.4.5 - Accelerate version: 0.34.2 - Accelerate config: not found - PyTorch version (GPU?): 2.4.1+cu121 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - Using distributed or parallel set-up in script?: <fill in> - Using GPU in script?: <fill in> - GPU type: NVIDIA A10G ### Who can help? @muellerz @SunMarc ### Information - [X] The official example scripts - [X] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below) ### Reproduction I am finetuning `google/gemma-2-2b` and these are the arguments and trainer call: ``` text_model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b", token=token, attn_implementation='eager') training_args = TrainingArguments( output_dir=args.log_dir, num_train_epochs=args.epochs, per_device_train_batch_size=args.train_batch_size, per_device_eval_batch_size=args.eval_batch_size, warmup_steps=args.warmup_steps, learning_rate=args.learning_rate, evaluation_strategy="no", logging_dir=args.log_dir, logging_steps=50, save_strategy="steps", save_steps=2000, report_to="mlflow", run_name=args.run_name, ) trainer = Trainer( model=text_model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset, compute_metrics=compute_metrics, ) ``` I am getting the following error when trainer tries to save the model: ``` RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'text_model.model.embed_tokens.weight', 'text_model.lm_head.weight'}]. A potential way to correctly save your model is to use `save_model`. ``` I have currently disabled saving as safetensors through the training arguments: `save_safetensors=False,` ### Expected behavior Should save in safetensors without raising an error.

John6666 · September 30, 2024, 8:40am

Thank you.

Topic		Replies	Views
Resetting base_model breaks shared tensors for safetensors Intermediate	3	42	March 14, 2025
Using Trainer to save a Bartforsequenceclassification model Beginners	3	2082	August 13, 2024
AutoModelForCausalLM.from_pretrained refuses to load safetensors weights Intermediate	0	951	December 5, 2023
Converting LLaMa 2 bin files to safetensors changes the output Beginners	0	1350	October 18, 2023
Issue while loading file-tuned gemma2 Models	3	181	December 29, 2024

Saving model in safetensors format through Trainer fails for Gemma 2 due to shared tensors

Related topics