Flan-UL2, must convert to save model?

Hi,

I got a Colab up and running, verbatim from the Flan-UL2 example script in the “Running the model” section. That means I got the model to run successfully, producing the expected output etc.

However, I noticed that during the model =(...) call, the notebook runtime downloaded 8 chunks of ~ 5 GB data. Being on Colab, I wanted to save the model to GDrive between each session. Naively, I ran

model.save_pretrained("/content/drive/MyDrive/Colab Notebooks/data/ul2-hf-model", from_pt=True)    

and after a runtime restart:

model = T5ForConditionalGeneration.from_pretrained("/content/drive/MyDrive/Colab Notebooks/data/ul2-hf-model", device_map="auto")   

The model seems to be loaded into memory as 4 checkpoint shards, and also seem to process my input string. The result to any input string however, is <pad></s>. I guess there is some basic flaw to my naive model saving attempt.

In the model card, I notice that there is a “Converting from T5x to HF” section, with a link to a conversion script. Do I need to run that script for the model, if I want to save it as a HF/ Torch model?

EDIT3:
If helpful, this is the output of print(model.hf_device_map):

{'shared': 0, 'lm_head': 0, 'encoder': 0, 'decoder.embed_tokens': 0, 'decoder.block.0': 0, 'decoder.block.1': 0, 'decoder.block.2': 0, 'decoder.block.3': 'cpu', 'decoder.block.4': 'cpu', 'decoder.block.5': 'cpu', 'decoder.block.6': 'cpu', 'decoder.block.7': 'cpu', 'decoder.block.8': 'cpu', 'decoder.block.9': 'cpu', 'decoder.block.10': 'cpu', 'decoder.block.11': 'cpu', 'decoder.block.12': 'cpu', 'decoder.block.13': 'cpu', 'decoder.block.14': 'cpu', 'decoder.block.15': 'cpu', 'decoder.block.16': 'cpu', 'decoder.block.17': 'cpu', 'decoder.block.18': 'cpu', 'decoder.block.19': 'cpu', 'decoder.block.20': 'cpu', 'decoder.block.21': 'cpu', 'decoder.block.22': 'cpu', 'decoder.block.23': 'cpu', 'decoder.block.24': 'cpu', 'decoder.block.25': 'cpu', 'decoder.block.26': 'cpu', 'decoder.block.27': 'cpu', 'decoder.block.28': 'cpu', 'decoder.block.29': 'cpu', 'decoder.block.30': 'cpu', 'decoder.block.31': 'cpu', 'decoder.final_layer_norm': 'cpu', 'decoder.dropout': 'cpu'}

I don’t know where I got the parameter from, but since I don’t have any PyTorch state dict file, using from_pt=True in the save method call probably does not make much sense. At another save attempt, I noticed the warning

UserWarning: You are calling `save_pretrained` to a 8-bit converted model you may likely encounter unexepected behaviors.

So I guess therein lies the culprit. After downloading the weights anew, inference provides the expected results, and calling

print(model.hf_device_map)

now shows

{'': 0}

Confirming that the model works after using argument

torch_dtype=torch.bfloat16

in the

from_pretrained

method. Using a Colab Pro notebook, with 40 GB GPU memory.