In my research project I’m loading UD using Datasets, however I’ve encountered a few issues along the way, that I want to resolve:
-
The version of UD is hard-coded (it’s 2.7 at the moment, while the most recent version is 2.10, and 2.11 is going to be released soon, I think). Simply hard-coding a newer version would be a bad solution, because sometimes you need to evaluate a very specific version, not the old one, and not the most recent one.
-
It would be helpful to be able to load an arbitrary
conllu
file. Maybe you want to see what effect some minor change in annotation has? -
Since
conllu
library is already used, it would be nice to be able convert the dataset back to this library’s representation. This would make serializing the data back toconllu
format a lot easier, for instance. Although, this might out of scope forDatasets.
So, as I said, I want to contribute to solving these issues, however I need some directions.
Ideally, I would like to select specific version like this: load_dataset("universal_dependencies", "fr_partut", version="r2.8")
, and load an arbitrary file like this: load_dataset("universal_dependencies", path_or_url="my_file.conllu")
. I’m not sure this is actually possible, however. Alternatively, subset name could be repurposed: load_dataset("universal_dependencies", "fr_partut:r2.8")
, but that looks a lot jankier, to be honest.