NLP: how to handle bad tokenization

Jourhighness · June 12, 2024, 3:28am

Hello, I get nonsense when trying to translate the following german sentence to swedish using google/madlad400-3b-mt:

a. Natürliche Personen: BundID mit ELSTER-Zertifikat oder nPA/eID/eAT-Authentifizierung
b. Juristische Personen: Unternehmenskonto BUND mit ELSTER-Zertifik

→

. Personen mit Behinderung: BundesID mit ELSTER-Zertifikat oder nPA/eID/eAT-Authentifizierung c. Personen mit Behinderung: BundesID mit ELSTER-Zertifikat oder nPA db. Personen mit Behinderung: BundesID mit ELSTER-Zertifikat oder nPA/e

Code:

pipe = pipeline("translation", model="google/madlad400-3b-mt")
pipe('<2sv>'+input, max_length = n_words*5)

This is likely due to the abundance of abbreviations and special words.

Is there a per sentence metric I can use to measure bad tokenizations? A naive one would be to calulate the percentage of unknown tokens. In my case the problem seems to be that it falsely attends to abbreviations rather than unknown confusion.

Topic		Replies	Views
Rare buggy translations when using Helsinki-NLP models Models	0	579	April 19, 2022
Manually replace part of translation Beginners	0	294	July 19, 2022
Keeping some tokens untranslated 🤗Transformers	0	566	October 15, 2020
How to efficiently tokenize unknown tokens in GPT2 Intermediate	0	1012	January 12, 2022
Ask for help with prediction results of Named Entity Recognition Task 🤗Transformers	10	3235	May 21, 2021

NLP: how to handle bad tokenization

Related topics