Scandinavian RoBERTa

Also, it turns out we will get access to our TPU server tomorrow! I’ll receive an email tomorrow about it and I’ll post all the relevant information here. Here is the HuggingFace guide on how to access the server.

1 Like

There were three folks from Google Brain giving talks yesterday about this community week. They are:

  • Skye Wanderman-Milne: Intro to JAX on Cloud TPUs
  • Marc van Zee: Introduction to Flax
  • Pablo Castro: Using Jax & Flax for RL with the Dopamine library

A recording of the talks can be found here.

1 Like

Alright, regarding our meeting, most of you were available on Sunday morning/midday, and a lot of you are also available tonight. So let’s just meet both of those times!

@Juunge is hosting a Google Meet chat in two hours, 8pm CEST, and we’ll organise a time on Sunday as well. Speak to y’all soon!
Jeg har stemt inde på huggingface! Uanset hvornår mødet bliver, så har jeg lavet et Google Meet møde og hvor jeg vil være i aften kl 20:00, hvis der er nogen som har tid til at chatte om projektet :slight_smile:

Link: https://meet.google.com/hni-hfdn-dof

1 Like

Hi all!
We’re thinking of moving the planning chat to Slack, as it makes things a bit easier. Could you all send me your email, either as a reply or a PM here? @Gabriel @versae @pere

1 Like

Per@capia.no

1 Like

Why would a multilingual model perform better than a model trained on single languages? Moreover, it seems to me that Icelandic is quite different from other 3.

1 Like

Hey Abhishek! Since Danish, Swedish and Norwegian are relatively low-resource languages, the idea is to increase the amount of pretraining data by combining the datasets available in the respective languages, and thus hopefully create a better model :slight_smile: Regarding Icelandic you’re right, it may be too different to include.

Sure. But as I understand correctly, the nuances are still different. there are different ways of writing things. For example: insurance in Norwegian is forsikring but in Swedish is försäkring. So with combination of data from these languages can result in generation of words from a mix of these languages. If im working with Norwegian, I wouldnt want it to generate Swedish words right? :slight_smile:

I do understand that these are low resource languages. What about combining the datasets you mentioned with bibliothek data for newspapers and articles from their respective countries? The last time I checked, a huge amount of text data was available from bibliothek website in Norway. WDYT?

1 Like

@pere has actually pretrained a Norwegian Bert model on that exact dataset :grin: NbAiLab/nb-bert-large · Hugging Face

Regarding generation of words with a combined Scandinavian pretrained model, I think it is going to learn to distinguish between the languages. However I have no experience pretraining transformers, so I’m not sure :sweat_smile:

Are you participating in the community week with a cool Flax/JAX project @abhishek? :grin:

Coming in late. I will like to join as this is much needed.

As @pere pointed out to me in a mail byte pair encoding still has the left-to-right encoding of WordPiece. I was thinking about byte encoding, e.g. ByT5. Maybe a trained tokenizer like Charformer would also be able to recognize that “forsikring” and “försäkring” have a lot of similar properties even if they are in different languages?