How to have no preset values sent into .compute() in Huggingface evaluate metrics?

We’ve a use-case llm_harness_mistral_arc.py · alvations/llm_harness_mistral_arc at main

where default feature input types for evaluate.Metric is nothing and we get something like this in our llm_harness_mistral_arc/llm_harness_mistral_arc.py

import evaluate
import datasets
import lm_eval


@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class llm_harness_mistral_arc(evaluate.Metric):
    def _info(self):
        # TODO: Specifies the evaluate.EvaluationModuleInfo object
        return evaluate.MetricInfo(
            # This is the description that will appear on the modules page.
            module_type="metric",
            description="",
            citation="",
            inputs_description="",
            # This defines the format of each prediction and reference
            features={},
        )

    def _compute(self, pretrained=None, tasks=[]):
        outputs = lm_eval.simple_evaluate( 
              model="hf",
              model_args={"pretrained":pretrained},
              tasks=tasks,
              num_fewshot=0,
          )
        results = {}
        for task in outputs['results']:
          results[task] = {'acc':outputs['results'][task]['acc,none'], 
                          'acc_norm':outputs['results'][task]['acc_norm,none']}
        return results

And in our expected user-behavior is something like, [in]:

import evaluate

module = evaluate.load("alvations/llm_harness_mistral_arc")
module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"])

And the expected output as per our tests.py, tests.py · alvations/llm_harness_mistral_arc at main [out]:

{'arc_easy': {'acc': 0.8131313131313131, 'acc_norm': 0.7680976430976431}}

But the evaluate.Metric.compute() somehow expects a default batch and module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"]) throws an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-20-bd94e5882ca5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2",
      2                tasks=["arc_easy"])

2 frames
[/usr/local/lib/python3.10/dist-packages/evaluate/module.py](https://localhost:8080/#) in _get_all_cache_files(self)
    309         if self.num_process == 1:
    310             if self.cache_file_name is None:
--> 311                 raise ValueError(
    312                     "Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` "
    313                     "at least once before calling `compute`."

ValueError: Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` at least once before calling `compute`.

Q: Is it possible for the .compute() to expect no features?

I’ve also tried this but somehoe the evaluate.Metric.compute is still looking for some sort of predictions variable.

@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class llm_harness_mistral_arc(evaluate.Metric):
    def _info(self):
        # TODO: Specifies the evaluate.EvaluationModuleInfo object
        return evaluate.MetricInfo(
            # This is the description that will appear on the modules page.
            module_type="metric",
            description="",
            citation="",
            inputs_description="",
            # This defines the format of each prediction and reference
            features=[
                datasets.Features(
                    {
                        "pretrained": datasets.Value("string", id="sequence"),
                        "tasks": datasets.Sequence(datasets.Value("string", id="sequence"), id="tasks"),
                    }
                )]
        )

    def _compute(self, pretrained, tasks):
        outputs = lm_eval.simple_evaluate( 
              model="hf",
              model_args={"pretrained":pretrained},
              tasks=tasks,
              num_fewshot=0,
          )
        results = {}
        for task in outputs['results']:
          results[task] = {'acc':outputs['results'][task]['acc,none'], 
                          'acc_norm':outputs['results'][task]['acc_norm,none']}
        return results

then:

import evaluate

module = evaluate.load("alvations/llm_harness_mistral_arc")
module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"])

[out]:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-36-bd94e5882ca5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2",
      2                tasks=["arc_easy"])

3 frames
[/usr/local/lib/python3.10/dist-packages/evaluate/module.py](https://localhost:8080/#) in _infer_feature_from_example(self, example)
    606             f"Predictions and/or references don't match the expected format.\n"
    607             f"Expected format:\n{feature_strings},\n"
--> 608             f"Input predictions: {summarize_if_long_list(example['predictions'])},\n"
    609             f"Input references: {summarize_if_long_list(example['references'])}"
    610         )

KeyError: 'predictions'

Also asked on

Hi ! .compute does expect features afaik (at least one, maybe you can define the tasks one)

Also note that values passed to .compute() should be lists (usually a list of references or predictions). I think that’s why your second attempt failed: you passed pretrained as a single value which is not the expected format