Distributed Inference with 🤗 Accelerate - Compare Baseline vs Fine Tuned Model

Hello Team,

I have a use where I need to compare LLM outputs of fine tuned model with a baseline model. In this case the baseline is codellama and a fine tuned test model. And I also need to run it on hundreds of prompts. I do have access to 4 gpus GCP box.

Initially I tried to load the evaluate dataset in pandas dataframe and used the apply method per row to apply model generate. I also used device_map auto with each model so that the models can get distributed over multiple gpus.

However, I believe we can scale up better with accelerate and spawn multiple processes to do the same instead of sequentially processing each row with pandas. I read the Distributed Inference with 🤗 Accelerate

Below is my implementation, This runs as desired. It would be great if you could review if it is correct design wise. I hope I can publish this as a tutorial or a blog to help others.

import os
import pandas as pd
from accelerate import Accelerator

from datasets import load_dataset

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def transform_user_question_to_chatml_format(user_question):
    user_part = [
        {"role": "user", "content": user_question},
    return user_part

def main():
    test_model_path = "asaha-cdcp/codellama-instruct-oasst-cdcp-tune-pack-false"
    reference_model_path = "codellama/CodeLlama-7b-Instruct-hf"
    eval_question_file = (
    eval_response_file = (
    question_field = "question"

    accelerator = Accelerator()

    eval_qs_ds = load_dataset("csv", data_files=eval_question_file)

    test_tokenizer = AutoTokenizer.from_pretrained(test_model_path)
    test_model = AutoModelForCausalLM.from_pretrained(
        test_model_path, torch_dtype = torch.float16, device_map='auto'

    reference_tokenizer = AutoTokenizer.from_pretrained(reference_model_path)
    reference_model = AutoModelForCausalLM.from_pretrained(
        reference_model_path, torch_dtype = torch.float16, device_map='auto'

    #no need to prepare since it is inference only
    #test_model = accelerator.prepare(test_model)
    #reference_model = accelerator.prepare(reference_model)

    with accelerator.main_process_first():

        def preprocess(samples):
            batch = []
            for user_question in samples[question_field]:
            return {"text": batch}

        eval_qs_ds = eval_qs_ds.map(
            desc=f"apply chat template to dataset",

    with accelerator.split_between_processes(eval_qs_ds["train"]["text"]) as processed_prompts:
        print(f"Processed prompts in Process {accelerator.process_index}:")
    # put models into eval mode

    with accelerator.split_between_processes(eval_qs_ds["train"]["text"]) as processed_prompts:
        if not test_tokenizer.pad_token:
            test_tokenizer.pad_token = reference_tokenizer.eos_token
        input_ids = test_tokenizer(processed_prompts, padding=True, truncation=True, return_tensors="pt").to(accelerator.device)

        # Generate text from test model
        with torch.no_grad():  # Disable gradient calculation for efficiency
            test_outputs = test_model.generate(

        # Decode the output text
        test_model_generated_response = test_tokenizer.batch_decode(
            test_outputs.cpu(), skip_special_tokens=False


    with accelerator.split_between_processes(eval_qs_ds["train"]["text"]) as processed_prompts:
        if not reference_tokenizer.pad_token:
            reference_tokenizer.pad_token = reference_tokenizer.eos_token
        input_ids = reference_tokenizer(processed_prompts, padding=True, truncation=True, max_length=512, return_tensors="pt").to(accelerator.device)

        # Generate text from reference model
        with torch.no_grad():  # Disable gradient calculation for efficiency
            reference_outputs = reference_model.generate(

            # should I batch decode here ? inside no grad ? 

         # or should I batch decode here ? outside the no grad ?
        reference_model_generated_response = reference_tokenizer.batch_decode(
            reference_outputs.cpu(), skip_special_tokens=False

        # should I create a partial file with process_index e.g.output_{accelerator.process_index}.csv ?
        # or can I leverage accelerator.gather(reference_model_generated_response) here ?
    # and then merge here
    with accelerator.is_main_process():
        # read individual files output_{i}.csv and merge

if __name__ == "__main__":
    hub_token = "<HF_TOKEN>"

    os.environ["HUGGING_FACE_HUB_TOKEN"] = hub_token


I have some doubts on how I can collate the outputs from the two models as two new columns and add it to the incoming dataset.

  1. Should we use accelerator.gather(...) to collect the reference_outputs and test_outputs? Does it apply in this case ? If yes, should it be done inside the torch.no_grad() block ? or outside ? In this example transformers/examples/pytorch/translation/run_translation_no_trainer.py at 976189a6df796a2ff442dd81b022626c840d8c27 · huggingface/transformers · GitHub it is done inside the no_grad block while in transformers/examples/pytorch/language-modeling/run_clm_no_trainer.py at 5c7e11e01012112498ece9ee4ab3894c2a702195 · huggingface/transformers · GitHub it is done outside the no_grad block. Which one is correct ?

  2. Or Should we write an intermediate temporary file per process, after accelerator.wait_for_everyone(), on the main process merge all the partial files into one file. I have added comments on my code to make it clear.

  3. Here two models are running sequentially and I have to create the two tokenizers. They tokenize the same prompt twice. I cannot use the tokenized output from test_tokenizer and use in the reference_model.generate(...) method of test model and reference model both, I get a cuda error. Is there a way to make these two model generation run in parallel too ?

  4. What happens if there are thousands of such prompts to evaluate ? Is it possible to splitting and streaming instead of loading all the data at once and delegating it to each process ? Could you suggest a scalable strategy ?

I deliberately disabled accelerator.prepare() for the models because we are just doing inference and there is nothing to sync across devices. Is my understanding right ?

May I request @smangrul @sgugger for review when feasible.

did something very similar, here’s my script accelerate_tricks/inference_batched.py at main · geronimi73/accelerate_tricks · GitHub

Thank You so much @g-ronimo Very nice. I read through them and I would be definitely implementing your tricks. This is indeed very helpful.

1 Like