I want to be able to edit Llama 3.1 code locally and run benchmark.
I have seen the lm-evaluation-harness library. Looks like its easy to run benchmark on a model and dataset on hf but I am not sure it is flexible to editing model code locally.
Could someone show me?