This space is for the discussion of morphologically rich languages in a cross-lingual fashion. MRLs are languages that have rich morphology causing recent advances in NLP to lag behind in terms of learning useful word or subword representations.
Hey everyone, we could start by mentioning some examples in each language and why it is considered an MRL language. I will start by mentioning some examples in Arabic .
Basically, the complexity of Arabic comes from the problem of non-concatenative morphology which happens when morphemes are not added in a linear fashion. I will mention to examples of concatenative and non-concatenative morphology
- Concatenative morphology: Let us consider the verb يعلم (to know) in which we added the morpheme ي to the stem علم to represent the present tense . We can also construct plurals like يعلمون (they know) which adds prefixes: ي and suffixes ون to the same stem.
- Non-concatenative morphology: happens mostly when the stem is modified/added to a template in a non-linear fashion. For example, broken or irregular plurals happens a lot in Arabic which adds the stem of the word into a template to create the plural. For instance, the word كاتب (writer) whereas the plural form is كتاب (writers).
Amharic is an MRL.
Here is an example for the verb to know. The verb morphs to the point where it seems like it has no relation to it’s root.
Verb ማወቅ (maweK - to know) አውቃለሁ ( I know) ታውቃለህ (you know male) ታውቂያለሽ (you know female) ያውቃል (he knows) ታውቃለች (she knows) ታውቃላችሁ (you know plural) እናውቃለን (we know) ያውቃሉ (they know, also used when referring to an older person) writer: ጸሐፊ writers: ጸሐፊዎች house: ቤት houses: ቤቶች (which I guess was at one time written as ቤትዎች) now contracted to ቤቶች (thanks forefathers for messing with NLP)
Hi, neat topic! Not sure if Hindi or Bengali can be considered MRLs. Their morphology is quite limited (Hindi has three cases and two genders, Bengali is similar but no gender; verbs paradigms are small for both, complexity of aspect and most tenses is provided by auxiliary verbs) and solely concatenative, and thus captured quite well by subword embeddings.
Interesting paper @Zaid! I think the problem here with usual subword segmentation methods may be that they are not suited for non-alphabetic scripts, rather than any complex morphology of Hindi and Bengali.
The Dravidian languages [largely in South India] do have more complex morphology (I believe they are agglutinative) so discussing them here may be worthwhile.
Thanks @Aryaman. If you are familiar with linguistics, can we sort langs in morphology complexity?