AutoModelForCausalLM and transformers.pipeline

alice86 · August 28, 2024, 9:38am

what is the different?
which method is good?

pipeline = transformers.pipeline(
    "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"

and

github.com

PromtEngineer/localGPT/blob/a1dea3becb8b1ae28a87369b1636c4c4a4501c27/load_models.py#L115


      
                  model_basename=model_basename,
                  use_safetensors=True,
                  trust_remote_code=True,
                  device_map="auto",
                  use_triton=False,
                  quantize_config=None,
              )
              return model, tokenizer
          
          
          def load_full_model(model_id, model_basename, device_type, logging):
              """
              Load a full model using either LlamaTokenizer or AutoModelForCausalLM.
          
              This function loads a full model based on the specified device type.
              If the device type is 'mps' or 'cpu', it uses LlamaTokenizer and LlamaForCausalLM.
              Otherwise, it uses AutoModelForCausalLM.
          
              Parameters:
              - model_id (str): The identifier for the model on HuggingFace Hub.
              - model_basename (str): The base name of the model file.

        model = AutoModelForCausalLM.from_pretrained(model_id,
                                            #  quantization_config=quantization_config,
                                            #  low_cpu_mem_usage=True,
                                            #  torch_dtype="auto",
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto",
                                             cache_dir="./models/")

        tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir="./models/")

    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_length=MAX_NEW_TOKENS,
        temperature=0.2,
        # top_p=0.95,
        repetition_penalty=1.15,
        generation_config=generation_config,
    )

    local_llm = HuggingFacePipeline(pipeline=pipe)

mahmutc · August 28, 2024, 10:23am

hi @alice86
it’s concerned with degrees of abstraction. There is no good or bad. It’s about being easy

Pipeline takes care of some of the details under the hood for you. It’s better to stick with the simple one until you reach the limits.

alice86 · August 29, 2024, 7:37am

I already download the model and directly use the path.

model_id = "./Llama3"

pipeline = transformers.pipeline(
    "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)

It show the error

ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.

Q1.
What is the define for You are trying to offload the whole model to the disk
If I use the model_id=meta-llama/Meta-Llama-3-8B .
It will automatically download the folder models–meta-llama–Meta-Llama-3-8B on ./cache

So the download model is also the case that offload the whole model to the disk ?

Q2. Because I saw the offload the whole model.
what is the onlineload the model code

Topic		Replies	Views
AutoModelForCausalLM() to HuggingFaceLLM Beginners	2	2984	October 4, 2024
HuggingFacePipeline Llama2 load_in_4bit from_model_id the model has been loaded with `accelerate` and therefore cannot be moved to a specific device 🤗Accelerate	2	7138	October 9, 2024
AutoModelForCausalLM.from_pretrained unable to load model from Huggingface 🤗Transformers	1	3117	June 25, 2023
Can't load my model from pipeline Beginners	3	1720	June 20, 2024
Hugging Face Llama-2 (7b) taking too much time while inferencing Models	1	1500	June 23, 2024

AutoModelForCausalLM and transformers.pipeline

Related topics