It appears that the code in this example returns NotImplementedError stating that Gleu is currently under construction. The last update was about a year ago.
I would like to know what is required for this metric to work from within datasets? Perhaps, I could make a pull request with necessary changes.
Hi ! I think someone started working on it but they stopped pretty early in the implementation.
The current code is here and it appears that the _compute method is incomplete/not working. This method is supposed to return a dictionary of metrics (for example {"gleu_mean": mean, "gleu_std": std, "gleu_ci": ci}) from the list of references and predictions.
If you’re interested in contributing, feel free to try to make this method work and create a PR
Locally you can test your metric with
from datasets import load_metric
gleu = load_metric("path/to/gleu.py")
gleu.compute(refererences=references, prefictions=predictions)
Is there a full list of metrics that are supposed to be there? For example, _compute for BLEU returns other metrics and these metrics are not actually computed within the method itself (it’s delegated to the original script).
GLEU’s _KWARGS_DESCRIPTION appears to be copy-pasted from BLEU, so that’s not it I guess.
We’re free to choose the list of metrics that are returned, and update the KWARGS_DESCRIPTION accordingly. I think returning the mean GLEU and the std is the already pretty nice
But feel free to take a look at papers where GLEU is used to see if there are other values that can be relevant.
I was actually looking for the second one for my current task. There’s an NLTK implementation for that one. It seems that it should be easy enough to add it to datasets. After all, nltk is already used by rouge metric, for example. The question, however, is how to distinguish these two GLEUs?
That said, I’m also interested in fixing the GEC GLEU as well (although after the MT one).
On a related note, is there an aliasing mechanism for metrics or something like that? So if you tried to load it like this load_metric("gleu"), then you’d be prompted to choose either of the two (by means of an exception, for example). I mean at a glance it’s not obvious at all that there are actually two metrics with a similar purpose that also share the same name.