RuntimeError with Mixed Precision during LoRA Fine-Tuning in LLAVA on Small GPU Machine

GonRos22 · September 23, 2024, 4:00pm

Hi everyone,

I’m facing an issue while fine-tuning the LLAVA model using LoRA on a machine with limited GPU resources. To accommodate the small GPU, I’ve been experimenting with 4-bit precision. However, I consistently encounter the following error:

RuntimeError: expected scalar type BFloat16 but found Float
This occurs specifically in the vision model, particularly during the LayerNorm operation in the forward pass.

Key Configuration:

Model: liuhaotian/llava-v1.6-vicuna-7b
Vision Tower: openai/clip-vit-large-patch14-336
LoRA: Enabled with lora_r=128, lora_alpha=256
Precision: 4-bit (bits=4)
Other Settings: bf16=True, gradient_checkpointing=True

Problem:

I’m running into a data type mismatch where some layers (e.g., LayerNorm) expect BFloat16, but are instead using Float32, which triggers the error. When I inspect the model, I find a mix of data types across the layers:

166 layers in float32
744 layers in bfloat16
369 layers in uint8

My Situation:

I’m trying to modify LLAVA for my own use case and need to run it in a “debug mode” to test and tweak the code. Since I have limited GPU resources, I’m using low precision (4-bit) to make debugging feasible. However, this data type mismatch is proving to be a roadblock.

My Questions:

How can I debug or fine-tune LLAVA with LoRA on a small GPU without running into these precision-related errors?
Should I be manually converting specific layers to avoid the mismatch between bfloat16 and float32?
Is there a general approach to running LoRA fine-tuning in a lightweight “debug mode” for code experimentation without worrying about outputs or precision mismatches?

Any guidance or suggestions would be greatly appreciated!

Thanks in advance!

John6666 · September 23, 2024, 4:11pm

It seems like torch’s autocast is doing something bad or CUDA version mismatch is the most common cause.

github.com/pytorch/pytorch

`RuntimeError: expected scalar type BFloat16 but found Float` with `torch.nn.TransformerEncoder`

opened 05:46AM - 22 Aug 23 UTC

jingxu10

module: nn triaged oncall: transformer/mha module: amp (automated mixed precision)

### 🐛 Describe the bug Runtime error occurred when running `torch.nn.Transfor…merEncoder` in AMP scope. This issue occurs for both when `enable_nested_tensor` is `True` and `False`. ```python import torch encoder_layer = torch.nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first = True) model = torch.nn.TransformerEncoder(encoder_layer, num_layers=6, enable_nested_tensor=True) model.eval() src_rand = torch.rand(16, 41, 512) mask_rand = torch.zeros(16, 41) with torch.no_grad(), torch.autocast(device_type="cpu", dtype=torch.bfloat16): out = model(src_rand, src_key_padding_mask = mask_rand) ``` Error message: ``` Traceback (most recent call last): File "/workspace/test/test1.py", line 13, in <module> out = model(torch.FloatTensor(src), src_key_padding_mask = mask, is_causal = True) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 387, in forward output = mod(output, src_mask=mask, is_causal=is_causal, src_key_padding_mask=src_key_padding_mask_for_layers) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/transformer.py", line 678, in forward return torch._transformer_encoder_layer_fwd( RuntimeError: expected scalar type BFloat16 but found Float ``` ### Versions ``` Collecting environment information... PyTorch version: 2.1.0.dev20230820+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 Clang version: Could not collect CMake version: version 3.27.2 Libc version: glibc-2.35 Python version: 3.10.9 (main, Jan 11 2023, 15:21:40) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-48-generic-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 Stepping: 6 CPU max MHz: 3600.0000 CPU min MHz: 800.0000 BogoMIPS: 6200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 1.5 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 40 MiB (32 instances) L3 cache: 72 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.25.2 [pip3] torch==2.1.0.dev20230820+cpu [pip3] torchaudio==2.1.0.dev20230821+cpu [pip3] torchvision==0.16.0.dev20230821+cpu [conda] mkl-include 2023.2.0 pypi_0 pypi [conda] mkl-static 2023.2.0 pypi_0 pypi [conda] numpy 1.25.2 pypi_0 pypi [conda] torch 2.1.0.dev20230820+cpu pypi_0 pypi [conda] torchaudio 2.1.0.dev20230821+cpu pypi_0 pypi [conda] torchvision 0.16.0.dev20230821+cpu pypi_0 pypi ``` cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki @bhosmer @cpuhrsch @erichan1 @drisspg @mcarilli @ptrblck @leslie-fang-intel @jgong5

github.com/Stability-AI/stablediffusion

RuntimeError: expected scalar type BFloat16 but found Float

opened 09:24AM - 10 Apr 23 UTC

picard314

Below is the log I have encountered at running "python scripts/txt2img.py --prom…pt "a professional photograph of an astronaut riding a horse" --ckpt <path/to/768model.ckpt/> --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768" Running DDIM Sampling with 50 timesteps DDIM Sampler: 0%| | 0/50 [00:00<?, ?it/s] data: 0%| | 0/1 [00:00<?, ?it/s] Sampling: 0%| | 0/3 [00:00<?, ?it/s] Traceback (most recent call last): File "scripts/txt2img.py", line 388, in <module> main(opt) File "scripts/txt2img.py", line 347, in main samples, _ = sampler.sample(S=opt.steps, File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/mnt/disk1/swh/git_sd/stablediffusion/ldm/models/diffusion/ddim.py", line 104, in sample samples, intermediates = self.ddim_sampling(conditioning, size, File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/mnt/disk1/swh/git_sd/stablediffusion/ldm/models/diffusion/ddim.py", line 164, in ddim_sampling outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps, File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/mnt/disk1/swh/git_sd/stablediffusion/ldm/models/diffusion/ddim.py", line 212, in p_sample_ddim model_uncond, model_t = self.model.apply_model(x_in, t_in, c_in).chunk(2) File "/mnt/disk1/swh/git_sd/stablediffusion/ldm/models/diffusion/ddpm.py", line 858, in apply_model x_recon = self.model(x_noisy, t, **cond) File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/disk1/swh/git_sd/stablediffusion/ldm/models/diffusion/ddpm.py", line 1335, in forward out = self.diffusion_model(x, t, context=cc) File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/disk1/swh/git_sd/stablediffusion/ldm/modules/diffusionmodules/openaimodel.py", line 797, in forward h = module(h, emb, context) File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/disk1/swh/git_sd/stablediffusion/ldm/modules/diffusionmodules/openaimodel.py", line 84, in forward x = layer(x, context) File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/mnt/disk1/swh/git_sd/stablediffusion/ldm/modules/attention.py", line 327, in forward x = self.norm(x) File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/nn/modules/normalization.py", line 272, in forward return F.group_norm( File "/root/miniconda3/envs/ldmsd/lib/python3.8/site-packages/torch/nn/functional.py", line 2516, in group_norm return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled) RuntimeError: expected scalar type BFloat16 but found Float Please, anyone has met the same and had a solution?

Topic		Replies	Views
Low bf16 performance on TPU, int4 vs int8 quantizatoin 🤗Accelerate	0	355	June 1, 2024
LoRA Finetuning RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! 🤗Transformers	4	41	June 16, 2025
Finetuning existing Lora Adapters gives "Attempting to unscale FP16 gradients" - Error 🤗Transformers	2	1317	June 25, 2024
CUDA Runtime Error in the Middle of Training Intermediate	1	1239	March 30, 2024
Errors when using gradient accumulation with FSDP + PEFT LoRA + SFTTrainer 🤗Accelerate	2	1116	February 6, 2025

RuntimeError with Mixed Precision during LoRA Fine-Tuning in LLAVA on Small GPU Machine

Key Configuration:

Problem:

My Situation:

My Questions:

Related topics