Hey everyone,
I am currently working on my master thesis and have used the Transformers library succesfully for most of the experiments I wanted to conduct. The only thing I am stuck with is loading a sharded version of Bloom-7b1, which I am loading using the accelerate load_checkpoint_and_dispatch following the guide Big model inference, which results in the following error.
File âload_bloom_test.pyâ, line 22, in
model = load_checkpoint_and_dispatch(
File â/opt/conda/lib/python3.8/site-packages/accelerate/big_modeling.pyâ, line 375, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File â/opt/conda/lib/python3.8/site-packages/accelerate/utils/modeling.pyâ, line 699, in load_checkpoint_in_model
set_module_tensor_to_device(model, param_name, param_device, value=param)
File â/opt/conda/lib/python3.8/site-packages/accelerate/utils/modeling.pyâ, line 105, in set_module_tensor_to_device
new_module = getattr(module, split)
File â/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.pyâ, line 1260, in getattr
raise AttributeError(ââ{}â object has no attribute â{}ââ.format(
AttributeError: âBloomForCausalLMâ object has no attribute âword_embeddingsâ
The code I am using works fine for other models, such as Galactica (for all variants, including galactica-120b) and looks like this:
import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
# Loading model from config on meta device and using sharded checkpoints
model_name = "models/bloom-7b1"
config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
# Calculating device map
device_map = infer_auto_device_map(
model,
no_split_module_classes=["BloomBlock"],
dtype=torch.float16
)
# Loading the checkpoint according to auto device_map
model = load_checkpoint_and_dispatch(
model,
model_name,
device_map=device_map
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer("Hello, my name is", return_tensors="pt")
inputs = inputs.to(0)
output = model.generate(inputs["input_ids"])
print(tokenizer.decode(output[0]))
I have redownloaded the model a couple of times and also tried using bigscience/bloom-7b1
to download it directly from the hub during execution of the script.
I would like this approach to work, as it has worked fine for all other models and I have not managed to get a distributed setup working using a different approach when using a model that doesnât fit on a single GPU anymore.
Any help with either using a different loading approach or a solution to my problem would be greatly appreciated