ALBERTI, multilingual model for poetry tasks


The idea is to do further training from multilingual BERT on a multilingual corpus of stanza and verse split poetry. The goal is to perform better than monolingual models on tasks such a metrical pattern prediction, rhyme and rhythm identification, rhetorical figure detection, and stanza type classification.

2. Language

This will be a multilingual model using Spanish, Italian, French, German, Czech, Hungarian, and English.

3. Model

We can use mBERT or XLM-R as the backbone model. Ideally, given the nature of the downstream tasks, it would be ideal to have ALBERT and its sentence ordering objective.

4. Datasets

We might need to convert corpora from Averell to datasets, which amounts to 25M words. There’s also a Spanish corpus called that contains 10M words as well, a German corpus with 4M words, and a Portuguese one with 1M words. In total would be around 40M words.

5. Training scripts

There are already Flax scripts to pre-train BERT and RoBERTa that we can easily use: transformers/examples/flax/language-modeling at master · huggingface/transformers · GitHub

6. Challenges

It is too little data and very short sequences. Hopefully, by using pre-existing weights we can overcome both issues. The fact that sequence lengths are generally short should make training faster as only one phase of 128 (or even 64) might be needed

7. Desired project outcome

A multilingual well performing model on the poetry downstream tasks

8. Reads


This is really interesting to me. Not sure how to solve the size issue of if it’s even possible. But definitively worth exploring.

Great idea! I would love to be involved with this!

1 Like

Looking forward to the results you will get!
I am working on poetry similarity and your model would be very useful for my team!
Good luck!

1 Like

Looks cool - think we can define it :slight_smile:

1 Like

The first thing to test might be to know how we can do further training using Jax from PyTorch weights.