I use Ubuntu in
Intel® Core™ i7-8565U CPU @ 1.80GHz × 8
16 Gb Ram
GeForce MX150 GPU
I have this model that I have tried in Google Collab. I know Collab provide much better environment than my pc. But when I tried to load it in my pc, I use 4bit quantization. It should be much easier to load, and it doesn’t seems like my pc lagging while loading it.
my code as follow:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
from langchain.llms import HuggingFacePipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("Yellow-AI-NLP/komodo-7b-base",cache_dir="./huggingface_cache/",trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Yellow-AI-NLP/komodo-7b-base",cache_dir="./huggingface_cache/",load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,llm_int8_enable_fp32_cpu_offload=True, device_map="auto",trust_remote_code=True)
model = model.to(device)
pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=1280
)
local_llm = HuggingFacePipeline(pipeline=pipeline)
Seems normal at first. Everything run smoothly until loading checkpoint shard:
Loading checkpoint shards: 0%| | 0/6
This run pretty smooth but suddenly my vscode closed. I try it again and my pc show blackscreen, took a while and it seems to be restarted. I don’t know what’s going on. No error show in vscode log or anything. Anyone know why this is happen? or anyone know how to debug it ?