Hi all,
I’m a co-maintainer of a dataset/service called SyntaxGym, which is a collection of targeted syntactic evaluations for autoregressive language models. We’re interested in providing SyntaxGym evaluations via the Huggingface Datasets API. Our dataset is structured a bit differently than your typical evaluation, though, so I wanted to know if/how it could be integrated into the existing Dataset API. I’ve included a schematic of typical evaluation items below for demonstration, followed by some apparent points of difference:
- This is an evaluation-only dataset; there is no train/test split.
- The dataset has several levels of structure:
- Circuits are categories of test suites testing different syntactic competencies.
- Test suites each test an individual syntactic phenomenon by evaluating one or more predictions over a series of items.
- Each item contains two or more minimal-pair sentences.
- Predictions evaluate differences in surprisal between sentence variants within an item, measured at some critical region.
- Metrics are accumulated differently at different levels. A model gets an item “correct” if it correctly behaves on all the relevant predictions; these boolean results are then averaged across items to derive a test suite “accuracy.”
My concrete questions:
- What is the best way to fit this structure into the Huggingface structure? One simple solution is to designate a single
train
split, with one subset for each test suite (roughly following the BLiMP structure). We’d then need to represent each test suite in a “long” format, with one row per region+item. - Our metric is relatively complex: it requires comparing a model’s predictions between multiple sentence inputs. (If we use the above dataset structure, it would actually involve constructing sentence inputs from the region rows, then passing to a model, and then decomposing the output.) Are there examples of similarly complex metric implementations with this API that I might be able to use as a reference?
- These metrics are computed dynamically by parsing formula strings. We use
pyparsing
for this purpose. Is it possible / problematic to have datasets carrying extra Python dependencies?
I hope these questions are somewhat clear. I could always just dive in and make something dirty, but was curious to hear if any more experienced HF users have ideas about best-practice directions here.