Why does adamw_bnb_8bit skip updating embedding parameters?

ozanciga · February 10, 2023, 3:07am

i’m using whisper for testing, but this probably applies to other models since in whisper’s source code there is no specific instruction to elicit this behavior.

when using adamw_bnb_8bit as the optimizer for seq2seq tasks, i noticed it automatically turns off (at least according to the output below which pops up right at the beginning of the training) embedding parameters.

is this a quirk/property of bitsandbytes package? i haven’t read the dettmers et al.'s work, so not sure if this is expected behavior.

@sanchit-gandhi any thoughts?

(for whisper-small)

skipped Embedding(1500, 768): 1.0986328125M params
skipped Embedding(51865, 768, padding_idx=50257): 39.085693359375M params
skipped WhisperPositionalEmbedding(448, 768): 39.413818359375M params
skipped: 39.413818359375M params

sanchit-gandhi · February 10, 2023, 2:59pm

Hey @ozanciga! Very interesting observation! My understanding is that embedding layers can become particularly unstable when downcast to lower precisions (fp16/fp8). It might be that there are exception clauses for embedding modules in the bitsandbytes package which prevents them from being downcast.

sanchit-gandhi · February 10, 2023, 3:00pm

Indeed! It looks like we need a special embedding layer for 8bit embeddings: GitHub - TimDettmers/bitsandbytes: 8-bit CUDA functions for PyTorch

See step 3: Replace embedding layer if necessary: torch.nn.Embedding(..) -> bnb.nn.Embedding(..)

ozanciga · February 10, 2023, 7:22pm

thank you @sanchit-gandhi , that seems about right. maybe a question for the hugging face devs, should this be handled on the backend automatically?

also, do you have any opinion on if freezing embeddings have any significant impact on the outcome? specifically asking for whisper but also interested in general since it’s very desirable to use 8bit in most cases.

ozanciga · February 12, 2023, 3:36am

actually i went through the source and turns out “skipped” means the optimizer is using 32bits for those parameters. i monkey patched the code to incorporate bnb.nn.stableembeddings, but i doubt it’s worth it for most cases.

github.com

huggingface/transformers/blob/67d074874d285e616393c65a0e670088e1b6b74a/src/transformers/trainer.py#L1084


      
              else:
                  self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
                  if optimizer_cls.__name__ == "Adam8bit":
                      import bitsandbytes
          
          
            manager = bitsandbytes.optim.GlobalOptimManager.get_instance()
          
          
            skipped = 0
                      for module in opt_model.modules():
                          if isinstance(module, nn.Embedding):
                              skipped += sum(dict((p.data_ptr(), p.numel()) for p in module.parameters()).values())
                              print(f"skipped {module}: {skipped/2**20}M params")
                              manager.register_module_override(module, "weight", {"optim_bits": 32})
                              logger.debug(f"bitsandbytes: will optimize {module} in fp32")
                      print(f"skipped: {skipped/2**20}M params")
          
          
if is_sagemaker_mp_enabled():
              self.optimizer = smp.DistributedOptimizer(self.optimizer)
          
          
return self.optimizer

sanchit-gandhi · February 17, 2023, 2:05pm

Hey @ozanciga! Sorry for the late reply here. Indeed, it could be worth handling it automatically by the trainer. Feel free to open an issue on the transformers repo if you want to discuss how this might look! Or directly a PR if you’ve got an idea on how to fix it already. Think this would make for a nice PR to the repo! Happy to help you any questions around the issue/PR!

sanchit-gandhi · February 17, 2023, 2:08pm

My intuition would be that freezing the embeddings won’t have a significant impact on the outcome. I believe in the Dalle-Mini project freezing the pre-trained embeddings actually gave superior performance vs non-frozen. Probably best to experiment here with say 1 epoch of data and see how eval WER performance compares with frozen / non-frozen embeddings

RASMUS · April 29, 2023, 9:18am

Now I ques it is automatically handled if trainer is correctly initialized in the main branch?

Topic		Replies	Views
Using quantized optimizer from bitsandbytes with transformers Beginners	0	1036	June 30, 2023
Fine-tuning with load_in_8bit and inference without load_in_8bit possible? 🤗Transformers	4	24367	August 23, 2022
Finetune only certain embeddings 🤗Transformers	0	13	July 19, 2024
Distributed fine-tuning with frozen embedding layers Beginners	0	911	August 16, 2022
Issues with Whisper Encoder: Positional Encoding 🤗Transformers	4	1580	November 16, 2022

Why does adamw_bnb_8bit skip updating embedding parameters?

Related topics