Scandinavian RoBERTa

saattrupdan · July 2, 2021, 10:59am

Also, it turns out we will get access to our TPU server tomorrow! I’ll receive an email tomorrow about it and I’ll post all the relevant information here. Here is the HuggingFace guide on how to access the server.

saattrupdan · July 2, 2021, 12:30pm

There were three folks from Google Brain giving talks yesterday about this community week. They are:

Skye Wanderman-Milne: Intro to JAX on Cloud TPUs
Marc van Zee: Introduction to Flax
Pablo Castro: Using Jax & Flax for RL with the Dopamine library

A recording of the talks can be found here.

saattrupdan · July 2, 2021, 4:11pm

Alright, regarding our meeting, most of you were available on Sunday morning/midday, and a lot of you are also available tonight. So let’s just meet both of those times!

@Juunge is hosting a Google Meet chat in two hours, 8pm CEST, and we’ll organise a time on Sunday as well. Speak to y’all soon!
Jeg har stemt inde på huggingface! Uanset hvornår mødet bliver, så har jeg lavet et Google Meet møde og hvor jeg vil være i aften kl 20:00, hvis der er nogen som har tid til at chatte om projektet

Link: https://meet.google.com/hni-hfdn-dof

saattrupdan · July 2, 2021, 8:02pm

Hi all!
We’re thinking of moving the planning chat to Slack, as it makes things a bit easier. Could you all send me your email, either as a reply or a PM here? @Gabriel @versae @pere

pere · July 2, 2021, 8:47pm

Per@capia.no

abhishek · July 4, 2021, 1:30pm

Why would a multilingual model perform better than a model trained on single languages? Moreover, it seems to me that Icelandic is quite different from other 3.

Juunge · July 5, 2021, 4:04pm

Hey Abhishek! Since Danish, Swedish and Norwegian are relatively low-resource languages, the idea is to increase the amount of pretraining data by combining the datasets available in the respective languages, and thus hopefully create a better model Regarding Icelandic you’re right, it may be too different to include.

abhishek · July 5, 2021, 4:22pm

Sure. But as I understand correctly, the nuances are still different. there are different ways of writing things. For example: insurance in Norwegian is forsikring but in Swedish is försäkring. So with combination of data from these languages can result in generation of words from a mix of these languages. If im working with Norwegian, I wouldnt want it to generate Swedish words right?

I do understand that these are low resource languages. What about combining the datasets you mentioned with bibliothek data for newspapers and articles from their respective countries? The last time I checked, a huge amount of text data was available from bibliothek website in Norway. WDYT?

Juunge · July 6, 2021, 11:59am

@pere has actually pretrained a Norwegian Bert model on that exact dataset NbAiLab/nb-bert-large · Hugging Face

Regarding generation of words with a combined Scandinavian pretrained model, I think it is going to learn to distinguish between the languages. However I have no experience pretraining transformers, so I’m not sure

Are you participating in the community week with a cool Flax/JAX project @abhishek?

proteusiq · July 9, 2021, 5:26am

Coming in late. I will like to join as this is much needed.

grofte · July 15, 2021, 12:12pm

As @pere pointed out to me in a mail byte pair encoding still has the left-to-right encoding of WordPiece. I was thinking about byte encoding, e.g. ByT5. Maybe a trained tokenizer like Charformer would also be able to recognize that “forsikring” and “försäkring” have a lot of similar properties even if they are in different languages?

Topic		Replies	Views
Pretrain RoBERTa-large from scratch in Swedish Flax/JAX Projects	2	1064	July 5, 2021
Pretrain RoBERTa-large from scratch in Finnish Flax/JAX Projects	1	352	June 29, 2021
BERTIN: PreTrain RoBERTa-large from scratch in Spanish Flax/JAX Projects	23	2005	July 19, 2021
PReTrain RoBERTa from scratch in Norwegian Flax/JAX Projects	2	880	June 28, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021

Scandinavian RoBERTa

Related topics