Hi,
I used to successfully run the rbawden/modern_french_normalisation model to convert Old French to modern French, but recently, the output has become incorrect.
Steps to reproduce (code provided on the Hugging Face page):
from transformers import pipeline
normaliser = pipeline(model="rbawden/modern_french_normalisation", batch_size=32, beam_size=5, cache_file="./cache.pickle", trust_remote_code=True)
list_inputs = ["Elle haïſſoit particulierement le Cardinal de Lorraine;", "Adieu, i'iray chez vous tantoſt vous rendre grace."]
list_outputs = normaliser(list_inputs)
print(list_outputs)
Expected output (as shown on the Hugging Face page that I was able to obtain previously):
[{'text': 'Elle haïssait particulièrement le Cardinal de Lorraine;',
'alignment': [([0, 4], [0, 4]), ([4, 5], [4, 5]), ([5, 13], [5, 13]), ([13, 14], [13, 14]), ([14, 30], [14, 30]), ([30, 31], [30, 31]), ([31, 33], [31, 33]), ([33, 34], [33, 34]), ([34, 42], [34, 42]), ([42, 43], [42, 43]), ([43, 45], [43, 45]), ([45, 46], [45, 46]), ([46, 54], [46, 54]), ([54, 55], [54, 55])]},
{'text': "Adieu, j'irai chez vous tantôt vous rendre grâce.",
'alignment': [([0, 5], [0, 5]), ([5, 6], [5, 6]), ([6, 7], [6, 7]), ([7, 9], [7, 9]), ([9, 13], [9, 13]), ([13, 14], [13, 14]), ([14, 18], [14, 18]), ([18, 19], [18, 19]), ([19, 23], [19, 23]), ([23, 24], [23, 24]), ([24, 31], [24, 30]), ([31, 32], [30, 31]), ([32, 36], [31, 35]), ([36, 37], [35, 36]), ([37, 43], [36, 42]), ([43, 44], [42, 43]), ([44, 49], [43, 48]), ([49, 50], [48, 49])]}]
Now, I am now getting errors in the output (“haïssoit” instead of “haïssait”, “grace” instead of “grâce”).
Additionally, when using the following input:
["Le Loup ne fut pas longtems à arriver à la maiſon de la Mere-grand, il heurte: Toc, toc, qui eſt là?"]
the output is really strange, with missing/repeated words :
'Le Loup ne fut pas longtems à arriver Fr la maison ne fut Pas-grand, est heurte ne Toc, long, qui est là?'
When loading the pipeline, I also notice these warning messages:
- pipeline.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.```
as well as:
```Some weights of FSMTForConditionalGeneration were not initialized from the model checkpoint at rbawden/modern_french_normalisation and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.```
I don’t recall whether these messages were displayed when the pipeline was working correctly.
Thanks for any help !