[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

sakshamio · November 26, 2020, 6:45am

I’d love to contribute to this effort!

kmfoda · November 26, 2020, 7:02am

Such a cool initiative. I would love to contribute as well please.

alexisperrier · November 26, 2020, 7:23am

@thomwolf
Great initiative, I would love to contribute.

dataset	paper	link
Paws-x	PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification	https://arxiv.org/abs/1908.11828
French classic texts	French Wikisource
Danish datasets		https://github.com/fnielsen/awesome-danish

and other French or Danish datasets

bharat-raghunathan · November 26, 2020, 7:53am

I’d love to contribute!
Is something like parallel corpora for translation accepted for these datasets?

harshith99 · November 26, 2020, 8:27am

@thomwolf I’d love to join the efforts! Looking forward to it

rohitdhamija · November 26, 2020, 8:55am

Count me in please!

vladdy · November 26, 2020, 8:55am

Count me in!

thomwolf · November 26, 2020, 9:09am

Thanks a lot for wanting to participate, this is really amazing! Over the end of the week, I’ll add you all to our slack channel with the email address associated to your account. If you don’t receive an invitation or don’t want to use this address, just tell me!

gdupont · November 26, 2020, 9:50am

I would be happy to contribute too.
I’ll compile my datasets list and see if I’ve got anything not already there.

nachuss · November 26, 2020, 10:35am

I would like to participate. thanks!

Fraser · November 26, 2020, 10:45am

I’m keen to help contribute!

moussaKam · November 26, 2020, 11:29am

Thank you for the initiative. It would be nice to add OrangeSum: a french abstractive summarization dataset. And I would be happy to contribute.

Dataset	Paper	Link
OrangeSum	BARThez: a Skilled Pretrained French Sequence-to-Sequence Model	https:https://github.com/Tixierae/OrangeSum

timpal0l · November 26, 2020, 12:32pm

Count me in @thomwolf - Swedish

Dataset	Paper	Link
STS-B	Why Not Simply Translate? A First Swedish Evaluation Benchmark for Semantic Similarity	https://github.com/timpal0l/sts-benchmark-swedish
SweSent		https://github.com/timpal0l/swedish-sentiment

chrisjay · November 26, 2020, 12:43pm

Thank you very much for this opportunity.
Will love to be part of this.
Will love to contribute to the Fon language. Here’ s a corpora on Fon-French pairs we (I and Bonaventure Dossou) worked on creating: https://github.com/bonaventuredossou/ffr-v1/tree/master/FFR-Dataset

vasudevgupta · November 26, 2020, 12:49pm

Count me in as well. I can help in getting datasets of indian languages.

diwakarmahajan · November 26, 2020, 1:15pm

@thomwolf great initiative! I would love to contribute. I can help out with biomedical/clinical datasets.

gmihaila · November 26, 2020, 2:08pm

@thomwolf I would love to help!

mrm8488 · November 26, 2020, 2:09pm

I am in @thomwolf

cstorm125 · November 26, 2020, 2:50pm

Hi @thomwolf . I would love to help for Thai. Here are the datasets:

Datasets	Tasks	Links/Github	Remarks
scb-mt-en-th-2020	machine translation (en-th)	airesearch.in.th/releases/machine-translation-datasets/	Paper in arxiv: /abs/2007.03541
prachathai-67k	sequence classification	PyThaiNLP/prachathai-67k
wisesight sentiment	sequence classification	PyThaiNLP/wisesight-sentiment
wongnai-corpus	sequence classification	wongnai/wongnai-corpus
TR-TPBS	summarization	nakhunchumpolsathien/TR-TPBS
Thai QA	question answering	https://aiforthai.in.th/corpus.php
LST20	token classification (NER)	https://aiforthai.in.th/corpus.php

Sorry for the links. I can only post 2 links as I am a new user.

sumanthd · November 26, 2020, 3:23pm

@thomwolf I would like to participate. We recently added indic_glue benchmark for evaluating models on indian languages. Would like to add more datasets to the library.

Topic		Replies	Views
Korean NLP - Introductions Languages at Hugging Face	2	1241	June 27, 2023
HuggingFace 🤗 is all you need for NLP and beyond [BLOG] 🤗Transformers	1	852	May 28, 2022
Collaborating with HuggingFace on Python Integration? Site Feedback	1	20	February 3, 2025
EMNLP Picks from the Hugging Face Science Team Research	1	4063	December 2, 2020
New disk usage quota for Hugging Face users, from December 2024 Beginners	3	176	December 11, 2024

[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Related topics