Evaluating LLM for specific programming languages

Hello everyone! I’m working on a project to fine-tune stable-code for the Ruby programming language.

As a first step, I’m looking at pointers to evaluate the available models too. I came across mxeval/multi-humaneval · Datasets at Hugging Face, but the dataset for Ruby seems incomplete, as the cannonical_solution field is empty.

I’m currently working out the canonical solutions for those 161 entries, but I’d like to ask the community if there are any other evaluation datasets that I can use. Any tips or suggestions are welcome. Thanks in advance.