MRLs (Morphologically Rich Languages) NLP

Zaid · February 23, 2021, 2:13pm

This space is for the discussion of morphologically rich languages in a cross-lingual fashion. MRLs are languages that have rich morphology causing recent advances in NLP to lag behind in terms of learning useful word or subword representations.

Zaid · February 23, 2021, 5:29pm

Hey everyone, we could start by mentioning some examples in each language and why it is considered an MRL language. I will start by mentioning some examples in Arabic .

Basically, the complexity of Arabic comes from the problem of non-concatenative morphology which happens when morphemes are not added in a linear fashion. I will mention to examples of concatenative and non-concatenative morphology

Concatenative morphology: Let us consider the verb يعلم (to know) in which we added the morpheme ي to the stem علم to represent the present tense . We can also construct plurals like يعلمون (they know) which adds prefixes: ي and suffixes ون to the same stem.
Non-concatenative morphology: happens mostly when the stem is modified/added to a template in a non-linear fashion. For example, broken or irregular plurals happens a lot in Arabic which adds the stem of the word into a template to create the plural. For instance, the word كاتب (writer) whereas the plural form is كتاب (writers).

yosiasz · February 23, 2021, 10:09pm

Amharic is an MRL.

Here is an example for the verb to know. The verb morphs to the point where it seems like it has no relation to it’s root.

Verb ማወቅ (maweK - to know)
አውቃለሁ ( I know)
ታውቃለህ (you know male)
ታውቂያለሽ (you know female)
ያውቃል (he knows)
ታውቃለች (she knows)
ታውቃላችሁ (you know plural)
እናውቃለን (we know)
ያውቃሉ (they know, also used when referring to an older person)

writer: ጸሐፊ
writers: ጸሐፊዎች

house: ቤት
houses: ቤቶች (which I guess was at one time written as ቤትዎች) 
now contracted to ቤቶች (thanks forefathers for messing with NLP)

Aryaman · February 25, 2021, 8:22pm

Hi, neat topic! Not sure if Hindi or Bengali can be considered MRLs. Their morphology is quite limited (Hindi has three cases and two genders, Bengali is similar but no gender; verbs paradigms are small for both, complexity of aspect and most tenses is provided by auxiliary verbs) and solely concatenative, and thus captured quite well by subword embeddings.

Zaid · February 25, 2021, 8:33pm

Hey @Aryaman , thanks for the reply. I was reading this paper , which considers the morphology for such langs. I guess I had the wrong impression .

Aryaman · February 25, 2021, 8:46pm

Interesting paper @Zaid! I think the problem here with usual subword segmentation methods may be that they are not suited for non-alphabetic scripts, rather than any complex morphology of Hindi and Bengali.

The Dravidian languages [largely in South India] do have more complex morphology (I believe they are agglutinative) so discussing them here may be worthwhile.

Zaid · February 25, 2021, 9:38pm

Thanks @Aryaman. If you are familiar with linguistics, can we sort langs in morphology complexity?

Topic		Replies	Views
Amharic NLP - Introductions Languages at Hugging Face	5	889	February 24, 2021
Arabic NLP - Tutorial - الدورة التعليمية Languages at Hugging Face	2	8247	February 22, 2021
Arabic NLP - Introductions Languages at Hugging Face	17	4551	February 27, 2025
Arabic NLP - Resources Languages at Hugging Face	2	2052	September 8, 2021
Bengali NLP - Introductions Languages at Hugging Face	14	2314	February 26, 2021

MRLs (Morphologically Rich Languages) NLP

Related topics