I’m considering pretraining a model based on an ancient Greek corpus. So far, I have not be able to find any transformer model of this language. So, I would like to know if somebody is already working on this before I start.
Second, I would like to know if it makes sense to pretrain an ancient Greek (grc) model given the fact that ancient Greek corpora are very small in comparison to modern languages corpora. The largest available ancient Greek corpus is about 500 MB, there is no ancient Greek Wikipedia articles, and google books scans are unreliable because of the many diacritics used in grc.
Is there any particular way to approach languages with small datasets? I’m also interested in other ancient languages with small historical registers, thinking that transformers could be used for text restoration (filling gaps in broken texts)