GLEU under construction?

adorkin · September 15, 2021, 8:46am

Hi,

It appears that the code in this example returns NotImplementedError stating that Gleu is currently under construction. The last update was about a year ago.

I would like to know what is required for this metric to work from within datasets? Perhaps, I could make a pull request with necessary changes.

lhoestq · September 15, 2021, 10:08am

Hi ! I think someone started working on it but they stopped pretty early in the implementation.

The current code is here and it appears that the _compute method is incomplete/not working. This method is supposed to return a dictionary of metrics (for example {"gleu_mean": mean, "gleu_std": std, "gleu_ci": ci}) from the list of references and predictions.

If you’re interested in contributing, feel free to try to make this method work and create a PR

Locally you can test your metric with

from datasets import load_metric

gleu = load_metric("path/to/gleu.py")
gleu.compute(refererences=references, prefictions=predictions)

adorkin · September 15, 2021, 11:29am

Is there a full list of metrics that are supposed to be there? For example, _compute for BLEU returns other metrics and these metrics are not actually computed within the method itself (it’s delegated to the original script).

GLEU’s _KWARGS_DESCRIPTION appears to be copy-pasted from BLEU, so that’s not it I guess.

lhoestq · September 15, 2021, 11:43am

We’re free to choose the list of metrics that are returned, and update the KWARGS_DESCRIPTION accordingly. I think returning the mean GLEU and the std is the already pretty nice
But feel free to take a look at papers where GLEU is used to see if there are other values that can be relevant.

adorkin · September 15, 2021, 2:47pm

Right, so I just realized there’s some confusion with the names of the metrics. There’s GLEU for grammatical error correction and there’s GLEU for machine translation (aka Google BLEU). They’re completely unrelated.

I was actually looking for the second one for my current task. There’s an NLTK implementation for that one. It seems that it should be easy enough to add it to datasets. After all, nltk is already used by rouge metric, for example. The question, however, is how to distinguish these two GLEUs?

That said, I’m also interested in fixing the GEC GLEU as well (although after the MT one).

adorkin · September 15, 2021, 2:48pm

The NLTK implementation is here.

lhoestq · September 16, 2021, 3:01pm

Oh Indeed we have a naming collision here.
Maybe name the first one gec_gleu and the other google_bleu ?

adorkin · September 17, 2021, 7:53am

Sounds good!

On a related note, is there an aliasing mechanism for metrics or something like that? So if you tried to load it like this load_metric("gleu"), then you’d be prompted to choose either of the two (by means of an exception, for example). I mean at a glance it’s not obvious at all that there are actually two metrics with a similar purpose that also share the same name.

lhoestq · September 21, 2021, 10:31am

No there is no aliasing mechanism

Topic		Replies	Views
What exact inputs does bleu_metric.compute() require? Beginners	5	3270	July 10, 2020
Compute the BLEU using pretrained T5-small Models	2	3975	April 13, 2022
Which tokenizer does "rouge" metric uses under the hood? 🤗Datasets	2	2174	July 11, 2022
Datasets - metrics Beginners	0	398	January 30, 2021
Problems with trainer.compute_metrics 🤗Transformers	1	213	September 15, 2024

GLEU under construction?

Related topics