[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

I’d love to contribute to this effort!

1 Like

Such a cool initiative. I would love to contribute as well please.

1 Like

@thomwolf
Great initiative, I would love to contribute.

dataset paper link
Paws-x PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification https://arxiv.org/abs/1908.11828
French classic texts French Wikisource
Danish datasets https://github.com/fnielsen/awesome-danish

and other French or Danish datasets

1 Like

I’d love to contribute!
Is something like parallel corpora for translation accepted for these datasets?

1 Like

@thomwolf I’d love to join the efforts! Looking forward to it

1 Like

Count me in please!

1 Like

Count me in!

1 Like

Thanks a lot for wanting to participate, this is really amazing! Over the end of the week, I’ll add you all to our slack channel with the email address associated to your account. If you don’t receive an invitation or don’t want to use this address, just tell me!

1 Like

I would be happy to contribute too.
I’ll compile my datasets list and see if I’ve got anything not already there.

1 Like

I would like to participate. thanks!

1 Like

I’m keen to help contribute!

1 Like

Thank you for the initiative. It would be nice to add OrangeSum: a french abstractive summarization dataset. And I would be happy to contribute.

1 Like

Count me in @thomwolf - Swedish :sweden:

1 Like

Thank you very much for this opportunity.
Will love to be part of this.
Will love to contribute to the Fon language. Here’ s a corpora on Fon-French pairs we (I and Bonaventure Dossou) worked on creating: https://github.com/bonaventuredossou/ffr-v1/tree/master/FFR-Dataset

1 Like

Count me in as well. I can help in getting datasets of indian languages.

1 Like

@thomwolf great initiative! I would love to contribute. I can help out with biomedical/clinical datasets.

1 Like

@thomwolf I would love to help!

1 Like

I am in @thomwolf

1 Like

Hi @thomwolf . I would love to help for Thai. Here are the datasets:

Datasets Tasks Links/Github Remarks
scb-mt-en-th-2020 machine translation (en-th) airesearch.in.th/releases/machine-translation-datasets/ Paper in arxiv: /abs/2007.03541
prachathai-67k sequence classification PyThaiNLP/prachathai-67k
wisesight sentiment sequence classification PyThaiNLP/wisesight-sentiment
wongnai-corpus sequence classification wongnai/wongnai-corpus
TR-TPBS summarization nakhunchumpolsathien/TR-TPBS
Thai QA question answering https://aiforthai.in.th/corpus.php
LST20 token classification (NER) https://aiforthai.in.th/corpus.php

Sorry for the links. I can only post 2 links as I am a new user.

3 Likes

@thomwolf I would like to participate. We recently added indic_glue benchmark for evaluating models on indian languages. Would like to add more datasets to the library.

1 Like