This space is for the discussion of morphologically rich languages in a cross-lingual fashion. MRLs are languages that have rich morphology causing recent advances in NLP to lag behind in terms of learning useful word or subword representations.
Hey everyone, we could start by mentioning some examples in each language and why it is considered an MRL language. I will start by mentioning some examples in Arabic .
Basically, the complexity of Arabic comes from the problem of non-concatenative morphology which happens when morphemes are not added in a linear fashion. I will mention to examples of concatenative and non-concatenative morphology
- Concatenative morphology: Let us consider the verb ΩΨΉΩΩ (to know) in which we added the morpheme Ω to the stem ΨΉΩΩ to represent the present tense . We can also construct plurals like ΩΨΉΩΩ ΩΩ (they know) which adds prefixes: Ω and suffixes ΩΩ to the same stem.
- Non-concatenative morphology: happens mostly when the stem is modified/added to a template in a non-linear fashion. For example, broken or irregular plurals happens a lot in Arabic which adds the stem of the word into a template to create the plural. For instance, the word ΩΨ§ΨͺΨ¨ (writer) whereas the plural form is ΩΨͺΨ§Ψ¨ (writers).
Amharic is an MRL.
Here is an example for the verb to know. The verb morphs to the point where it seems like it has no relation to itβs root.
Verb ααα
(maweK - to know)
α αααα ( I know)
α³αααα
(you know male)
α³ααα«αα½ (you know female)
α«ααα (he knows)
α³αααα½ (she knows)
α³αααα½α (you know plural)
α₯ααααα (we know)
α«ααα (they know, also used when referring to an older person)
writer: αΈαα
writers: αΈαααα½
house: α€α΅
houses: α€αΆα½ (which I guess was at one time written as α€α΅αα½)
now contracted to α€αΆα½ (thanks forefathers for messing with NLP)
Hi, neat topic! Not sure if Hindi or Bengali can be considered MRLs. Their morphology is quite limited (Hindi has three cases and two genders, Bengali is similar but no gender; verbs paradigms are small for both, complexity of aspect and most tenses is provided by auxiliary verbs) and solely concatenative, and thus captured quite well by subword embeddings.
Hey @Aryaman , thanks for the reply. I was reading this paper , which considers the morphology for such langs. I guess I had the wrong impression .
Interesting paper @Zaid! I think the problem here with usual subword segmentation methods may be that they are not suited for non-alphabetic scripts, rather than any complex morphology of Hindi and Bengali.
The Dravidian languages [largely in South India] do have more complex morphology (I believe they are agglutinative) so discussing them here may be worthwhile.
Thanks @Aryaman. If you are familiar with linguistics, can we sort langs in morphology complexity?