Universal Dependencies limitations

In my research project I’m loading UD using :hugs:Datasets, however I’ve encountered a few issues along the way, that I want to resolve:

  1. The version of UD is hard-coded (it’s 2.7 at the moment, while the most recent version is 2.10, and 2.11 is going to be released soon, I think). Simply hard-coding a newer version would be a bad solution, because sometimes you need to evaluate a very specific version, not the old one, and not the most recent one.

  2. It would be helpful to be able to load an arbitrary conllu file. Maybe you want to see what effect some minor change in annotation has?

  3. Since conllu library is already used, it would be nice to be able convert the dataset back to this library’s representation. This would make serializing the data back to conllu format a lot easier, for instance. Although, this might out of scope for :hugs:Datasets.

So, as I said, I want to contribute to solving these issues, however I need some directions.

Ideally, I would like to select specific version like this: load_dataset("universal_dependencies", "fr_partut", version="r2.8"), and load an arbitrary file like this: load_dataset("universal_dependencies", path_or_url="my_file.conllu"). I’m not sure this is actually possible, however. Alternatively, subset name could be repurposed: load_dataset("universal_dependencies", "fr_partut:r2.8"), but that looks a lot jankier, to be honest.