Creating a SyntaxGym dataset -- structure and evaluation questions

Hi all,

I’m a co-maintainer of a dataset/service called SyntaxGym, which is a collection of targeted syntactic evaluations for autoregressive language models. We’re interested in providing SyntaxGym evaluations via the Huggingface Datasets API. Our dataset is structured a bit differently than your typical evaluation, though, so I wanted to know if/how it could be integrated into the existing Dataset API. I’ve included a schematic of typical evaluation items below for demonstration, followed by some apparent points of difference:

  1. This is an evaluation-only dataset; there is no train/test split.
  2. The dataset has several levels of structure:
  3. Circuits are categories of test suites testing different syntactic competencies.
  4. Test suites each test an individual syntactic phenomenon by evaluating one or more predictions over a series of items.
  5. Each item contains two or more minimal-pair sentences.
  6. Predictions evaluate differences in surprisal between sentence variants within an item, measured at some critical region.
  7. Metrics are accumulated differently at different levels. A model gets an item “correct” if it correctly behaves on all the relevant predictions; these boolean results are then averaged across items to derive a test suite “accuracy.”

My concrete questions:

  1. What is the best way to fit this structure into the Huggingface structure? One simple solution is to designate a single train split, with one subset for each test suite (roughly following the BLiMP structure). We’d then need to represent each test suite in a “long” format, with one row per region+item.
  2. Our metric is relatively complex: it requires comparing a model’s predictions between multiple sentence inputs. (If we use the above dataset structure, it would actually involve constructing sentence inputs from the region rows, then passing to a model, and then decomposing the output.) Are there examples of similarly complex metric implementations with this API that I might be able to use as a reference?
  3. These metrics are computed dynamically by parsing formula strings. We use pyparsing for this purpose. Is it possible / problematic to have datasets carrying extra Python dependencies?

I hope these questions are somewhat clear. I could always just dive in and make something dirty, but was curious to hear if any more experienced HF users have ideas about best-practice directions here.

If it’s helpful, there is further documentation on the test suite structure here.

I’ve figured out most of the process myself by this point, see main definition file here.

Answers to the above, for posterity’s sake at least:

  1. No, you can directly fit nested data structures into single “examples” in the Huggingface representation. I was able to just create a feature specification that just matched the existing structure of SyntaxGym item representations.
  2. This is easy when model outputs for a single input example suffice to compute the metric (and this is the case with the above representation).
  3. No problem. HF datasets automatically detects import errors triggered within dataset code and warns the user.
1 Like

Looks like a nice project :slight_smile: I’m glad you managed to find the answers to your questions.

Feel free to ping me if you have more questions :wink: