Help with merging LoRA to base model

Unnati24 · April 22, 2025, 6:56pm

Hi, i have fine tuned llama 8b instruct model by QLoRA (loaded base model in 4bit followed by Lora) and saved trained model. im using google colab T4(15gb) gpu. when i load base model completely on gpu (device_map=“cuda:0”) in float 16 for merging then i get Out Of Memory error
if i use following code i get NotImplementedError: Cannot copy out of meta tensor; no data!
Is there any solution?

`tokenizer = AutoTokenizer.from_pretrained(NEW_MODEL)

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16,
device_map=“auto”,
offload_folder=OFFLOAD_DIR,
token = HF_TOKEN
)

model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)
model = PeftModel.from_pretrained(model,NEW_MODEL,offload_folder=OFFLOAD_DIR)
model = model.merge_and_unload()`

John6666 · April 23, 2025, 7:08am

When using bitsandbytes, I think it is better not to use “auto” (or rather, accelerate) to avoid errors.

#device_map=“auto”,

github.com/huggingface/transformers

NotImplementedError: Cannot copy out of meta tensor; no data! when using device = "auto" in pipeline()

opened 05:23PM - 09 Oct 23 UTC

closed 06:34PM - 23 Oct 23 UTC

yongjer

### System Info - `transformers` version: 4.34.0 - Platform: Linux-5.15.0-86-g…eneric-x86_64-with-glibc2.31 - Python version: 3.11.6 - Huggingface_hub version: 0.17.3 - Safetensors version: 0.4.0 - Accelerate version: 0.24.0.dev0 - Accelerate config: not found - PyTorch version (GPU?): 2.1.0+cu121 (True) - Tensorflow version (GPU?): not installed (NA) - Flax version (CPU?/GPU?/TPU?): not installed (NA) - Jax version: not installed - JaxLib version: not installed - GPU: RTX2060 6G ### Who can help? @Narsil ### Information - [ ] The official example scripts - [x] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below) ### Reproduction here is my code below: ``` def ndarray_to_image(ndarray): return Image.fromarray(np.uint8(ndarray)) import cv2 from transformers import pipeline from PIL import Image import numpy as np cap = cv2.VideoCapture(0) while True: ret, frame = cap.read() cv2.imshow('frame', frame) image = ndarray_to_image(frame) pipe = pipeline("object-detection", model="facebook/detr-resnet-50", device_map="auto") result = pipe(image) print(result) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows() ``` when set pipeline(device_map="auto") will raise an error: ``` { "name": "NotImplementedError", "message": "Cannot copy out of meta tensor; no data!", "stack": "--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) /home/yongjer/程式/object detection/main.ipynb 儲存格 1 line 1 <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=11'>12</a> cv2.imshow('frame', frame) <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=13'>14</a> image = ndarray_to_image(frame) ---> <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=15'>16</a> pipe = pipeline(\"object-detection\", model=\"facebook/detr-resnet-50\", device_map=\"auto\") <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=16'>17</a> result = pipe(image) <a href='vscode-notebook-cell:/home/yongjer/%E7%A8%8B%E5%BC%8F/object%20detection/main.ipynb#W6sZmlsZQ%3D%3D?line=17'>18</a> print(result) File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/pipelines/__init__.py:834, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs) 832 if isinstance(model, str) or framework is None: 833 model_classes = {\"tf\": targeted_task[\"tf\"], \"pt\": targeted_task[\"pt\"]} --> 834 framework, model = infer_framework_load_model( 835 model, 836 model_classes=model_classes, 837 config=config, 838 framework=framework, 839 task=task, 840 **hub_kwargs, 841 **model_kwargs, 842 ) 844 model_config = model.config 845 hub_kwargs[\"_commit_hash\"] = model.config._commit_hash File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/pipelines/base.py:269, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs) 263 logger.warning( 264 \"Model might be a PyTorch model (ending with `.bin`) but PyTorch is not available. \" 265 \"Trying to load the model with Tensorflow.\" 266 ) 268 try: --> 269 model = model_class.from_pretrained(model, **kwargs) 270 if hasattr(model, \"eval\"): 271 model = model.eval() File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:565, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs) 563 elif type(config) in cls._model_mapping.keys(): 564 model_class = _get_model_class(config, cls._model_mapping) --> 565 return model_class.from_pretrained( 566 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs 567 ) 568 raise ValueError( 569 f\"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\ \" 570 f\"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapping.keys())}.\" 571 ) File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/modeling_utils.py:3085, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs) 3082 config = cls._check_and_enable_flash_attn_2(config, torch_dtype=torch_dtype, device_map=device_map) 3084 with ContextManagers(init_contexts): -> 3085 model = cls(config, *model_args, **model_kwargs) 3087 # Check first if we are `from_pt` 3088 if use_keep_in_fp32_modules: File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/detr/modeling_detr.py:1498, in DetrForObjectDetection.__init__(self, config) 1495 super().__init__(config) 1497 # DETR encoder-decoder model -> 1498 self.model = DetrModel(config) 1500 # Object detection heads 1501 self.class_labels_classifier = nn.Linear( 1502 config.d_model, config.num_labels + 1 1503 ) # We add one for the \"no object\" class File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/detr/modeling_detr.py:1330, in DetrModel.__init__(self, config) 1327 super().__init__(config) 1329 # Create backbone + positional encoding -> 1330 backbone = DetrConvEncoder(config) 1331 object_queries = build_position_encoding(config) 1332 self.backbone = DetrConvModel(backbone, object_queries) File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/detr/modeling_detr.py:361, in DetrConvEncoder.__init__(self, config) 359 # replace batch norm by frozen batch norm 360 with torch.no_grad(): --> 361 replace_batch_norm(backbone) 362 self.model = backbone 363 self.intermediate_channel_sizes = ( 364 self.model.feature_info.channels() if config.use_timm_backbone else self.model.channels 365 ) File ~/miniforge3/envs/od/lib/python3.11/site-packages/transformers/models/detr/modeling_detr.py:319, in replace_batch_norm(model) 316 if isinstance(module, nn.BatchNorm2d): 317 new_module = DetrFrozenBatchNorm2d(module.num_features) --> 319 new_module.weight.data.copy_(module.weight) 320 new_module.bias.data.copy_(module.bias) 321 new_module.running_mean.data.copy_(module.running_mean) NotImplementedError: Cannot copy out of meta tensor; no data!" } ``` ### Expected behavior when set device=0 rather than device_map = "auto", it works

github.com/huggingface/transformers

NotImplementedError: Cannot copy out of meta tensor; no data!

opened 07:29AM - 29 May 24 UTC

closed 07:08AM - 11 Jun 24 UTC

CHNRyan

When I fine tuning llama2 with deepspeed and qlora on one node and multi GPUs, I… used zero3 to partition the model paramters, but it always first load the whole params on each GPU and partition params just before training instead load params after partition it. After I check the huggingface document, I find it need to put `TrainingArguments` before `from_pretrained`. I did it and zero3_init indeed wored, but the confusing problem arised: NotImplementedError: Cannot copy out of meta tensor; no data! Here is my code: ```python from datasets import load_dataset import torch from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments import bitsandbytes as bnb from peft import LoraConfig from trl import SFTTrainer from accelerate import Accelerator accelerator = Accelerator() import deepspeed dataset = load_dataset("json",data_files="Belle_open_source_0.5M_changed.json",split="train") result_dir = "tmp" training_args = TrainingArguments( report_to="none", output_dir=result_dir, # per_device_train_batch_size * gradient_accumulation_steps = batch_size per_device_train_batch_size=1, gradient_accumulation_steps=16, learning_rate=2e-4, logging_steps=10, # max_steps=520, num_train_epochs=0.016, save_steps=500, # 65 bf16 = True, # set bf16 to True with an A100 # optim='paged_adamw_32bit', gradient_checkpointing=True, # group_by_length=True, # remove_unused_columns=False, # warmup_ratio=0.03, # lr_scheduler_type='constant', # max_grad_norm=0.3 ) current_device = Accelerator().process_index print("current_device:", current_device) # print(type(current_device)) base_model_name ="/home/yangtong/data/llama2-hf/llama2-13b-chat_hf" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, # bnb_4bit_quant_storage=torch.bfloat16 ) base_model = AutoModelForCausalLM.from_pretrained( base_model_name, quantization_config=bnb_config, torch_dtype=torch.bfloat16, load_in_4bit=True ) base_model.tie_weights() base_model.config.use_cache = False base_model.config.pretraining_tp = 1 def find_all_linear_names(model): cls = bnb.nn.Linear4bit lora_module_names = set() for name, module in model.named_modules(): if isinstance(module, cls): names = name.split('.') lora_module_names.add(names[0] if len(names) == 1 else names[-1]) if 'lm_head' in lora_module_names: # needed for 16-bit lora_module_names.remove('lm_head') return list(lora_module_names) models=find_all_linear_names(base_model) # print(models) peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM", target_modules=models ) tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True) tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True tokenizer.pad_token = tokenizer.eos_token max_seq_length = 512 trainer = SFTTrainer( model=base_model, train_dataset=dataset, peft_config=peft_config, dataset_text_field="text", max_seq_length=max_seq_length, tokenizer=tokenizer, args=training_args ) trainer.train() output_dir = os.path.join(result_dir, "final_checkpoint") trainer.model.save_pretrained(output_dir) ``` Here is my accelerate config: ``` compute_environment: LOCAL_MACHINE debug: false deepspeed_config: deepspeed_config_file: /home/yangtong/ft_dis/ds_config/3.json zero3_init_flag: true distributed_type: DEEPSPEED downcast_bf16: 'no' enable_cpu_affinity: false machine_rank: 0 main_training_function: main num_machines: 1 num_processes: 4 rdzv_backend: 'c10d' same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false ``` And here is error: ``` Traceback (most recent call last): File "/home/yangtong/ft_dis/ft_acc_new.py", line 58, in <module> base_model = AutoModelForCausalLM.from_pretrained( File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained return model_class.from_pretrained( File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained ) = cls._load_pretrained_model( File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model( File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 749, in _load_state_dict_into_meta_model set_module_quantized_tensor_to_device( File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 108, in set_module_quantized_tensor_to_device new_value = value.to(device) NotImplementedError: Cannot copy out of meta tensor; no data! ``` After I set `low_cpu_mem_usage=False` in `from_pretrained`, here is another error: ``` Traceback (most recent call last): File "/home/yangtong/ft_dis/ft_acc_new.py", line 58, in <module> base_model = AutoModelForCausalLM.from_pretrained( File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained return model_class.from_pretrained( File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3366, in from_pretrained dispatch_model(model, **device_map_kwargs) File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/accelerate/big_modeling.py", line 419, in dispatch_model attach_align_device_hook_on_blocks( File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/accelerate/hooks.py", line 608, in attach_align_device_hook_on_blocks add_hook_to_module(module, hook) File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/accelerate/hooks.py", line 157, in add_hook_to_module module = hook.init_hook(module) File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/accelerate/hooks.py", line 275, in init_hook set_module_tensor_to_device(module, name, self.execution_device, tied_params_map=self.tied_params_map) File "/home/yangtong/anaconda3/envs/llama2/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 354, in set_module_tensor_to_device raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.") ValueError: weight is on the meta device, we need a `value` to put in on 0. ``` I also try to set `empty_init=False` , but the error is LlamaForCausalLM.from_pretrained doesn't has this paramter. I will truly appreciate if anyone can help me solve it !

Topic		Replies	Views
Inquiry Regarding Out of Memory Issue During LoRA Fine-Tuning Models	2	105	May 5, 2025
How to load the finetuned model (merged weights) on colab? 🤗Transformers	1	1491	November 27, 2023
Qunatized model with LORA takes much more GPU memory than the un-quantized model with LORA for the (E-5-Large Embedding Transformer) 🤗Transformers	4	1748	October 8, 2023
qloRA with cpu offload 🤗Transformers	1	942	February 22, 2024
Cannot copy out of meta tensor; no data! 🤗Transformers	4	4821	February 28, 2025

Help with merging LoRA to base model

Related topics