hello - referring to https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one:
import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names
training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
decay_parameters = get_parameter_names(model, [nn.LayerNorm])
decay_parameters = [name for name in decay_parameters if "bias" not in name]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if n in decay_parameters],
"weight_decay": training_args.weight_decay,
},
{
"params": [p for n, p in model.named_parameters() if n not in decay_parameters],
"weight_decay": 0.0,
},
]
optimizer_kwargs = {
"betas": (training_args.adam_beta1, training_args.adam_beta2),
"eps": training_args.adam_epsilon,
}
optimizer_kwargs["lr"] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
optimizer_grouped_parameters,
betas=(training_args.adam_beta1, training_args.adam_beta2),
eps=training_args.adam_epsilon,
lr=training_args.learning_rate,
)
is above code still needed when using latest version of transformers (4.31) from main? can’t we just do:
args=transformers.TrainingArguments(
...
optim='paged_adamw_8bit',
)
and do I as a user still need to do something (what? example would be nice) w.r.t. this note in the docs:
Note that in order to use the 8-bit optimizer with an existing pretrained model a change to the embedding layer is needed. Read this issue for more information