Iād love to contribute to this effort!
Such a cool initiative. I would love to contribute as well please.
@thomwolf
Great initiative, I would love to contribute.
dataset | paper | link |
---|---|---|
Paws-x | PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification | https://arxiv.org/abs/1908.11828 |
French classic texts | French Wikisource | |
Danish datasets | https://github.com/fnielsen/awesome-danish |
and other French or Danish datasets
Iād love to contribute!
Is something like parallel corpora for translation accepted for these datasets?
@thomwolf Iād love to join the efforts! Looking forward to it
Count me in please!
Count me in!
Thanks a lot for wanting to participate, this is really amazing! Over the end of the week, Iāll add you all to our slack channel with the email address associated to your account. If you donāt receive an invitation or donāt want to use this address, just tell me!
I would be happy to contribute too.
Iāll compile my datasets list and see if Iāve got anything not already there.
I would like to participate. thanks!
Iām keen to help contribute!
Thank you for the initiative. It would be nice to add OrangeSum: a french abstractive summarization dataset. And I would be happy to contribute.
Dataset | Paper | Link |
---|---|---|
OrangeSum | BARThez: a Skilled Pretrained French Sequence-to-Sequence Model | https:https://github.com/Tixierae/OrangeSum |
Count me in @thomwolf - Swedish
Thank you very much for this opportunity.
Will love to be part of this.
Will love to contribute to the Fon language. Hereā s a corpora on Fon-French pairs we (I and Bonaventure Dossou) worked on creating: https://github.com/bonaventuredossou/ffr-v1/tree/master/FFR-Dataset
Count me in as well. I can help in getting datasets of indian languages.
@thomwolf great initiative! I would love to contribute. I can help out with biomedical/clinical datasets.
@thomwolf I would love to help!
I am in @thomwolf
Hi @thomwolf . I would love to help for Thai. Here are the datasets:
Datasets | Tasks | Links/Github | Remarks |
---|---|---|---|
scb-mt-en-th-2020 | machine translation (en-th) | airesearch.in.th/releases/machine-translation-datasets/ | Paper in arxiv: /abs/2007.03541 |
prachathai-67k | sequence classification | PyThaiNLP/prachathai-67k | |
wisesight sentiment | sequence classification | PyThaiNLP/wisesight-sentiment | |
wongnai-corpus | sequence classification | wongnai/wongnai-corpus | |
TR-TPBS | summarization | nakhunchumpolsathien/TR-TPBS | |
Thai QA | question answering | https://aiforthai.in.th/corpus.php | |
LST20 | token classification (NER) | https://aiforthai.in.th/corpus.php |
Sorry for the links. I can only post 2 links as I am a new user.
@thomwolf I would like to participate. We recently added indic_glue benchmark for evaluating models on indian languages. Would like to add more datasets to the library.