MRLs (Morphologically Rich Languages) NLP

This space is for the discussion of morphologically rich languages in a cross-lingual fashion. MRLs are languages that have rich morphology causing recent advances in NLP to lag behind in terms of learning useful word or subword representations.


Hey everyone, we could start by mentioning some examples in each language and why it is considered an MRL language. I will start by mentioning some examples in Arabic .

Basically, the complexity of Arabic comes from the problem of non-concatenative morphology which happens when morphemes are not added in a linear fashion. I will mention to examples of concatenative and non-concatenative morphology

  • Concatenative morphology: Let us consider the verb ΩŠΨΉΩ„Ω… (to know) in which we added the morpheme ي to the stem ΨΉΩ„Ω… to represent the present tense . We can also construct plurals like ΩŠΨΉΩ„Ω…ΩˆΩ† (they know) which adds prefixes: ي and suffixes ΩˆΩ† to the same stem.
  • Non-concatenative morphology: happens mostly when the stem is modified/added to a template in a non-linear fashion. For example, broken or irregular plurals happens a lot in Arabic which adds the stem of the word into a template to create the plural. For instance, the word ΩƒΨ§ΨͺΨ¨ (writer) whereas the plural form is ΩƒΨͺΨ§Ψ¨ (writers).

Amharic is an MRL.

Here is an example for the verb to know. The verb morphs to the point where it seems like it has no relation to it’s root.

Verb αˆ›α‹ˆα‰… (maweK - to know)
αŠ α‹α‰ƒαˆˆαˆ ( I know)
α‰³α‹α‰ƒαˆˆαˆ… (you know male)
α‰³α‹α‰‚α‹«αˆˆαˆ½ (you know female)
α‹«α‹α‰ƒαˆ (he knows)
α‰³α‹α‰ƒαˆˆα‰½ (she knows)
α‰³α‹α‰ƒαˆ‹α‰½αˆ (you know plural)
αŠ₯αŠ“α‹α‰ƒαˆˆαŠ• (we know)
α‹«α‹α‰ƒαˆ‰ (they know, also used when referring to an older person)

writer: ጸሐፊ
writers: αŒΈαˆαŠα‹Žα‰½

house: ቀቡ
houses: ቀቢች (which I guess was at one time written as α‰€α‰΅α‹Žα‰½) 
now contracted to ቀቢች (thanks forefathers for messing with NLP)

Hi, neat topic! Not sure if Hindi or Bengali can be considered MRLs. Their morphology is quite limited (Hindi has three cases and two genders, Bengali is similar but no gender; verbs paradigms are small for both, complexity of aspect and most tenses is provided by auxiliary verbs) and solely concatenative, and thus captured quite well by subword embeddings.

1 Like

Hey @Aryaman , thanks for the reply. I was reading this paper , which considers the morphology for such langs. I guess I had the wrong impression :confused:.

Interesting paper @Zaid! I think the problem here with usual subword segmentation methods may be that they are not suited for non-alphabetic scripts, rather than any complex morphology of Hindi and Bengali.

The Dravidian languages [largely in South India] do have more complex morphology (I believe they are agglutinative) so discussing them here may be worthwhile.

1 Like

Thanks @Aryaman. If you are familiar with linguistics, can we sort langs in morphology complexity?