[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Iā€™d love to contribute to this effort!

1 Like

Such a cool initiative. I would love to contribute as well please.

1 Like

@thomwolf
Great initiative, I would love to contribute.

dataset paper link
Paws-x PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification https://arxiv.org/abs/1908.11828
French classic texts French Wikisource
Danish datasets https://github.com/fnielsen/awesome-danish

and other French or Danish datasets

1 Like

Iā€™d love to contribute!
Is something like parallel corpora for translation accepted for these datasets?

1 Like

@thomwolf Iā€™d love to join the efforts! Looking forward to it

1 Like

Count me in please!

1 Like

Count me in!

1 Like

Thanks a lot for wanting to participate, this is really amazing! Over the end of the week, Iā€™ll add you all to our slack channel with the email address associated to your account. If you donā€™t receive an invitation or donā€™t want to use this address, just tell me!

1 Like

I would be happy to contribute too.
Iā€™ll compile my datasets list and see if Iā€™ve got anything not already there.

1 Like

I would like to participate. thanks!

1 Like

Iā€™m keen to help contribute!

1 Like

Thank you for the initiative. It would be nice to add OrangeSum: a french abstractive summarization dataset. And I would be happy to contribute.

1 Like

Count me in @thomwolf - Swedish :sweden:

1 Like

Thank you very much for this opportunity.
Will love to be part of this.
Will love to contribute to the Fon language. Hereā€™ s a corpora on Fon-French pairs we (I and Bonaventure Dossou) worked on creating: https://github.com/bonaventuredossou/ffr-v1/tree/master/FFR-Dataset

1 Like

Count me in as well. I can help in getting datasets of indian languages.

1 Like

@thomwolf great initiative! I would love to contribute. I can help out with biomedical/clinical datasets.

1 Like

@thomwolf I would love to help!

1 Like

I am in @thomwolf

1 Like

Hi @thomwolf . I would love to help for Thai. Here are the datasets:

Datasets Tasks Links/Github Remarks
scb-mt-en-th-2020 machine translation (en-th) airesearch.in.th/releases/machine-translation-datasets/ Paper in arxiv: /abs/2007.03541
prachathai-67k sequence classification PyThaiNLP/prachathai-67k
wisesight sentiment sequence classification PyThaiNLP/wisesight-sentiment
wongnai-corpus sequence classification wongnai/wongnai-corpus
TR-TPBS summarization nakhunchumpolsathien/TR-TPBS
Thai QA question answering https://aiforthai.in.th/corpus.php
LST20 token classification (NER) https://aiforthai.in.th/corpus.php

Sorry for the links. I can only post 2 links as I am a new user.

3 Likes

@thomwolf I would like to participate. We recently added indic_glue benchmark for evaluating models on indian languages. Would like to add more datasets to the library.

1 Like