How to use peft+base merged models in offline mode?

I’ve a model that I’ve remerged and push to my own repo on alvations/ALMA-7B-R-remerged · Hugging Face

When I do this with internet, it works as expected:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# Load base model and LoRA weights
model = AutoModelForCausalLM.from_pretrained("alvations/ALMA-7B-R-remerged", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("alvations/ALMA-7B-R-remerged", padding_side='left')

But when I did the same in offline-mode, it is giving some safetensor resolving error:

HF_HUB_OFFLINE=1 TRANSFORMER_OFFLINE=1 \
python -c 'import torch; from transformers import AutoModelForCausalLM; model2 = AutoModelForCausalLM.from_pretrained("alvations/ALMA-7B-R", torch_dtype=torch.float16, device_map="auto", cache_dir="./mynewdir", local_files_only=True); print(model2)'

[out]:

Cannot reach https:/...remerged/resolve/main/adapter_model.safetensors: offline mode is enabled...

Q1: How to load the merge model that I’ve pushed to hub in offline-mode?

Q2: Is my tokenizer files, base model and adapter model files in the right directory structure as expected by huggingface_hub?

Q3: Why is safetensors trying to resolve the file path/name when I load a model? Any way to avoid that?


More details

Given a model that has been previously merged:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# Load base model and LoRA weights
model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-7B-R", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/ALMA-7B-R", padding_side='left')

If we save the based model and adapter weights and also the tokenizer into the same directory:

model.save_pretrained("alvations/ALMA-7B-R")
model.base_model.save_pretrained("alvations/ALMA-7B-R")
tokenizer.save_pretrained("alvations/ALMA-7B-R")

We seem to be able to load the model as such:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model2 = AutoModelForCausalLM.from_pretrained(
  "alvations/ALMA-7B-R", 
  torch_dtype=torch.float16, 
  device_map="auto", local_files_only=True)

Even though the config files are showing:

$ cat alvations/ALMA-7B-R/config.json
{
  "_name_or_path": "haoranxu/ALMA-7B-R",
  "architectures": [
    "LlamaForCausalLM"
  ],
...
}

It works with local_files_only=True when offline-mode isn’t turned off.

And then when we try to load it with offline mode off in Colab. It kinda works too but I’m unsure if Colab is really turning the HF_HUB_OFFLINE access

e.g.

HF_HUB_OFFLINE=1 TRANSFORMER_OFFLINE=1 python -c 'import torch; from transformers import AutoModelForCausalLM; model2 = AutoModelForCausalLM.from_pretrained("alvations/ALMA-7B-R", torch_dtype=torch.float16, device_map="auto", local_files_only=True, cache_dir="."); print(model2)'

[out]:

Loading checkpoint shards: 100% 3/3 [00:07<00:00,  2.50s/it]
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
     ...
            (lora_A): ModuleDict(
              (default): Linear(in_features=11008, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
...
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

But if we push to hub:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model2 = AutoModelForCausalLM.from_pretrained(
  "alvations/ALMA-7B-R", 
  torch_dtype=torch.float16, 
  device_map="auto", local_files_only=True)

tokenizer = AutoTokenizer.from_pretrained(
  "alvations/ALMA-7B-R", 
  torch_dtype=torch.float16, 
  device_map="auto", local_files_only=True)

model2.base_model.push_to_hub("alvations/ALMA-7B-R-remerged")
model.base_model.push_to_hub("alvations/ALMA-7B-R-remerged")
tokenizer.push_to_hub("alvations/ALMA-7B-R-remerged")

Then do a snapshot_download from

from huggingface_hub import snapshot_download
snapshot_download("alvations/ALMA-7B-R-remerged", cache_dir="./mynewdir")

And do this locally:

HF_HUB_OFFLINE=1 TRANSFORMER_OFFLINE=1 python -c 'import torch; from transformers import AutoModelForCausalLM; model2 = AutoModelForCausalLM.from_pretrained("alvations/ALMA-7B-R", torch_dtype=torch.float16, device_map="auto", cache_dir="./mynewdir", local_files_only=True); print(model2)'

It’s giving some errors on safetensor resolving:

Cannot reach https:/...remerged/resolve/main/adapter_model.safetensors: offline mode is enabled...

Possibly I might not be saving the right thing in the first place, when I load the saved model from local directory, it gets wonky!

Step 1: Save the tokenizer/peft/base models files into a single local directory

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# Load base model and LoRA weights
model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-7B-R", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/ALMA-7B-R", padding_side='left')

# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model2.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)


model.save_pretrained("alvations/ALMA-7B-R")
tokenizer.save_pretrained("alvations/ALMA-7B-R")

# Probably this is where it gets wonky since the base_model safetensor files overwrite the tuned model.
model.base_model.save_pretrained("alvations/ALMA-7B-R")

import os
os._exit(00)

[out]:

['Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish: I love machine translation.']

Step 2: And after loading from local directory, it gets all wonky:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alvations/ALMA-7B-R",
    local_files_only=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "alvations/ALMA-7B-R",
    local_files_only=True
)


# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

[out]:

['Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:thm interessbirпар Never oil Fulogether Fulickedogetherogethericked Never Klosterogetherogetherogetherickedogether']

Then, I tried to avoid saving the model.base_model.save_pretrained()

Step 0: Remove the previously saved local model directory

! rm -rf "alvations/ALMA-7B-R"

Step 1: Save the model + tokenizer without the model.base_model

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

# Load base model and LoRA weights
model = AutoModelForCausalLM.from_pretrained("haoranxu/ALMA-7B-R", torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("haoranxu/ALMA-7B-R", padding_side='left')

# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

model.save_pretrained("alvations/ALMA-7B-R")
tokenizer.save_pretrained("alvations/ALMA-7B-R")

import os
os._exit(00)

[out]:

['Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish: I love machine translation.']

Step 2: Local the model + tokenizer from the local directory

It sort of works alright now.

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alvations/ALMA-7B-R",
    local_files_only=True,
    torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "alvations/ALMA-7B-R",
    local_files_only=True
)


# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

import os
os._exit(00)

[out]:

['Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish: I love machine translation.']

Step 3a: Make sure HF_HUB_OFFLINE=1 works and throws an error for a model not found locally

! HF_HUB_OFFLINE=1 python -c 'from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")'

[out]:


! HF_HUB_OFFLINE=1 python -c 'from transformers import AutoTokenizer; tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1238, in hf_hub_download
    metadata = get_hf_file_metadata(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1631, in get_hf_file_metadata
    r = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 385, in _request_wrapper
    response = _request_wrapper(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 408, in _request_wrapper
    response = get_session().request(method=method, url=url, **params)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_http.py", line 78, in send
    raise OfflineModeIsEnabled(
huggingface_hub.utils._http.OfflineModeIsEnabled: Cannot reach https://huggingface.co/facebook/nllb-200-distilled-600M/resolve/main/config.json: offline mode is enabled. To disable it, please unset the `HF_HUB_OFFLINE` environment variable.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py", line 1371, in hf_hub_download
    raise LocalEntryNotFoundError(
huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 782, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1111, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 633, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 688, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 441, in cached_file
    raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like facebook/nllb-200-distilled-600M is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Step3b: Run Step 2 with HF_HUB_OFFLINE=1

This seems to work with offline model directory saved using .save_pretrained(...)

%%writefile test.py

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alvations/ALMA-7B-R",
    local_files_only=True,
    torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "alvations/ALMA-7B-R",
    local_files_only=True
)

# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

Then:

! HF_HUB_OFFLINE=1 python test.py

[out]:

['Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish: I love machine translation.']

Step 4a: Now lets push that model to Huggingface Hub

from huggingface_hub import notebook_login
notebook_login()

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alvations/ALMA-7B-R",
    local_files_only=True,
    torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "alvations/ALMA-7B-R",
    local_files_only=True
)

model.push_to_hub("alvations/ALMA-7B-R-remerged")
tokenizer.push_to_hub("alvations/ALMA-7B-R-remerged")

import os
os._exit(00)

Step 4b: Redownload the model from HF hub to a local directory using snapshot_download()

from huggingface_hub import snapshot_download
snapshot_download("alvations/ALMA-7B-R-remerged", cache_dir="mynewcachedir")

Step 4c: (With Internet) Reload the model not from the local directory but from the cache_dir

! rm -rf alvations/*

Then:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alvations/ALMA-7B-R-remerged",
    local_files_only=True,
    cache_dir="mynewcachedir",
    torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "alvations/ALMA-7B-R-remerged",
    cache_dir="mynewcachedir",
    local_files_only=True
)


# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

import os
os._exit(00)

[out]:

['Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish: I love machine translation.']

Step 5: (Without Internet) Reload the model not from the local directory but from the cache_dir with HF_HUB_OFFLINE=1

%%writefile test2.py

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alvations/ALMA-7B-R-remerged",
    local_files_only=True,
    cache_dir="mynewcachedir",
    torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "alvations/ALMA-7B-R-remerged",
    cache_dir="mynewcachedir",
    local_files_only=True
)

# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

Then:

! HF_HUB_OFFLINE=1 python test2.py

[out]:

['Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish: I love machine translation.']

Then we try:

! HF_HUB_OFFLINE=1 TRANSFORMER_OFFLINE=1  python test2.py

[out]:

['Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish: I love machine translation.']

Q: Then why do we get different errors initially with Huggingface accessing Hub to resolve safetensors ?

A: Maybe different transformers and tokenizers version?

FYI, all the above in this comment is from colab with these pip freeze:

transformers==4.38.2
tokenizers==0.15.2

TL;DR all the above replies…

Q1: How to load the merge model that I’ve pushed to hub in offline-mode?

See Step 4 + Step 5 above, work with HF_HUB_OFFLINE=1 (w/o internet) too:

from huggingface_hub import snapshot_download
snapshot_download("alvations/ALMA-7B-R-remerged", cache_dir="mynewcachedir")

Then:

import torch
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "alvations/ALMA-7B-R-remerged",
    local_files_only=True,
    cache_dir="mynewcachedir",
    torch_dtype=torch.float16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
    "alvations/ALMA-7B-R-remerged",
    cache_dir="mynewcachedir",
    local_files_only=True
)

# Add the source sentence into the prompt template
prompt="Translate this from Chinese to English:\nChinese: 我爱机器翻译。\nEnglish:"
input_ids = tokenizer(prompt, return_tensors="pt", padding=True, max_length=40, truncation=True).input_ids.cuda()

# Translation
with torch.no_grad():
    generated_ids = model.generate(input_ids=input_ids, num_beams=5, max_new_tokens=20, do_sample=True, temperature=0.6, top_p=0.9)
outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(outputs)

Q2: Is my tokenizer files, base model and adapter model files in the right directory structure as expected by huggingface_hub?

If the model is already pre-merged like haoranxu/ALMA-7B-R · Hugging Face, there’s no need to save the model.base_model.push_to_hub / model.base_model.save_pretrained.

Just these are enough:

  • model.save_pretrained() and
  • tokenizer.save_pretrained()

Q3: Why is safetensors trying to resolve the file path/name when I load a model? Any way to avoid that?

This I have no idea. After upgrading the transformers and tokenizers version to the following, the safetensor errors didn’t surface:

transformers==4.38.2
tokenizers==0.15.2

Maybe it’s related to these?? The load_lora_weights is not working offline · Issue #6110 · huggingface/diffusers · GitHub and safetensors offline mode hf_hub_offline=1 site:github.com - Google Search