Multiple issues with JSON output of punctuation restoration model

moriquendi · April 26, 2022, 7:40am

I get weird JSON output when trying this punctuation-restoration model.

The author said that the model assigns “punctuation mark” to each word passed into the model.
However, there seems to be an issue with pre or post processing that causes punctuation to be inserted in the middle of words.

Try the input I pasted below. A single word “mathematician” is broken into 3 words. A fullstop is inserted in the middle of the word. How’s that possible?
I think start / end params in the output array are supposed to be positions of the word in the final output string. But it doesn’t count for spaces between the words. And so one word’s end = next one’s start. And because of the issue #1, we can’t relay on the fact that punctuation mark is always inserted at the end of the word.
In the JSON output there’s a “” (empty string) for some label that were supposed to be “-”.

Input:

I’ve never known Warhol through my life who have been I don’t think I’m obsessed is an understatement of zest with puzzles of different types and wanted to get traditional one of them was one of the most curious and intelligent people I’ve ever met Houghton Houghton Conway and I would I’m looking at as Wikipedia unfortunately passed away some time ago but English mathematician and then the theory of finite

I’m new to this so please bare with me.
Who’s in fault here?
Is it the author of the model that writes a script to generate the output or is it managed by HuggingFace? I can’t find anywhere the script responsible for processing the input/output.

Model link: oliverguhr/fullstop-punctuation-multilang-large · Hugging Face

Topic		Replies	Views
Effect of punctuations on Transformer models Beginners	0	537	January 12, 2022
What is the preferred way to preprocess punctuation? 🤗Transformers	0	235	October 13, 2022
Found some inconsistency on CLIPTokenizer, but how should we fix this? Intermediate	0	582	October 6, 2022
Preprocessing raw text 🤗Tokenizers	2	592	October 26, 2022
Treating Punctuatio restoration as Seq2Seq task Intermediate	0	506	December 11, 2020

Multiple issues with JSON output of punctuation restoration model

Related topics