[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

thomwolf · November 23, 2020, 4:10pm

Hi all,

We are planning to do one of the biggest team effort we have ever done next week (Nov 30th to Dec 4th) to reach the v2.0 of the datasets library (Edit: final day extended to next Wednesday Dec 9th!).

The effort will involve more than half of HuggingFace (!) with about 15 people including members who’ve defined the library like @lhoestq @yjernite, @joeddav, @jplu @patrickvonplaten, members of the research team like @teven @VictorSanh and the OSS team like @Narsil, newcomers like @abhishek, awesome part-time members like @aymm and @canwenxu and many others including @madlag or yours truly. (Edit: And now over 200 external participants as well )

It will be targetted toward adding and tagging a large number of NLP datasets to the datasets library with the goal being to reach +500 datasets and covers and organize as much of the NLP dataset eco-system as we find possible.

We are taking the occasion to develop some tools to more easily add and tag datasets in the library as well as create dataset cards for them.

After internal discussion, we have decided to open this time-limited project to external contributors if you want to have a little taste of what it is to participate in an internal HuggingFace team effort.

Basically, you can ping me or anyone of us and I will add you to the slack channel and give you access to the tools we use as well as detailed information on the workflow and a list of datasets that we think are worth adding.

There might be (Edit: “will definitely be”) a small reward as HuggingFace swag and of course sharing your contribution to this project but keep in mind that this is an open-source effort so join if you want to do an open-contribution and enjoy a bit of HuggingFace vibe, this is not an internship or work offer (for this you should check and apply on our profile on AngelList!). We expect most of the work to be done by the full-time members of HuggingFace but we are also always happy to share how we work and collaborate with external contributors which why we are opening this project.

what is it about:

we are adding a lot of new datasets to the library (in particular in many NLP tasks and we would like to have more datasets in low ressource languages as well) with the aim to cover as much ground as possible

how you can join:

post here to say that you want to participate and I will add you to our slack => That’s it

what you’ll get

enjoy a bit of HuggingFace vibe by joining the team sprint
receive a special event gift (actually 2 gifts, see this post further down the thread for details!) because it’s really amazing to see the community so involved here that we wanted to remember this event!

BIG UPDATE
We have just updated the deadline to next Wednesday (Dec 9th) So the late comers can still participate!

SECOND BIG UPDATE
A lot of people are still joining (on the way to be 300 participants ) so we are extending a bit the deadline again — though it will a limited extension because we have to end the project at some point

More precisely:
All the participants who will have open at least 1 PR before the end of Wednesday (Dec 9th) can continue adding additional datasets until the end of Sunday (Dec 13th) that will be counted in the sprint.

In other word:
If you have open 1 PR before Wednesday (and thus are eligible for the special event tee-shirt goody ) you will have until the end of Sunday to add 2 others datasets if you want, and join the main-contributors channel of the slack (+ get the special event mug)

Open-sourcely yours,

Thom

BramVanroy · November 24, 2020, 12:48pm

Love this! I’ll be too preoccupied the following weeks, but I’ll definitely join in in the future if such an event is done again!

vblagoje · November 25, 2020, 1:21pm

@thomwolf @lhoestq I want to join and work on adding larger datasets used for model pre-training. I’d start with preparing datasets used to train Alexa Bort. We already have Wikipedia, Bookcorpus, OpenWebText. I want to add Wiktionary, UrbanDictionary, One Billion Words, the news subset of Common Crawl (Nagel, 2016).

I’d like to contribute to the creation of datasets tooling so any researchers working on a nextgen LM can quickly and easily make arbitrary “brews” of these large datasets and use them in pre-training.

Cheers,
Vladimir

leozhao · November 25, 2020, 9:43pm

Count me in please.

Zaid · November 25, 2020, 9:44pm

Hey, How do decide which datasets to add ? Any priorities for low resource languages? I am interested in working with Arabic datasets.

yjernite · November 25, 2020, 9:47pm

@Zaid yes we are aiming to improve coverage of low resource languages! Would you mind posting some of the Arabic language datasets you’d want to see added? (And if you can add links to the data location and paper if available that would be fantastic!)

rafaelsandroni · November 25, 2020, 9:48pm

Hi, I’m open to contribute. Do you have any backlog?
About low resources languages, I’m interested in working with Portuguese datasets.

thomwolf · November 25, 2020, 9:51pm

I’ve added a note to the main post. Basically if you know datasets you would like to see added in the library (e.g. in portuguese) feel free to dump them here with a link to their location for instance.

anaerobeth · November 25, 2020, 9:53pm

I’m interested in contributing to this effort.

Zaid · November 25, 2020, 10:09pm

This is an initial list for Arabic

Dataset	Paper	Link
SOQAL	Neural Arabic Question Answering	https://github.com/husseinmozannar/SOQAL
HARD	Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications	https://github.com/elnagara/HARD-Arabic-Dataset
ArsentD-LEV	A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets	http://oma-project.com/ArSenL/ArSenTD_Lev_Intro
ANERcorp	ANERsys: An Arabic Named Entity Recognition System Based on Maximum Entropy	http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp
LABR	A Large-SCale Arabic Book Reviews Dataset	https://github.com/mohamedadaly/LABR
AJGT	N/A	https://github.com/komari6/Arabic-twitter-corpus-AJGT
Multi-datasets	Building Large Arabic Multi-domain Resources for Sentiment Analysis	https://github.com/hadyelsahar/large-arabic-sentiment-analysis-resouces
TEAD	Using Tweets and Emojis to Build TEAD: an Arabic Dataset for Sentiment Analysis	https://github.com/HSMAabdellaoui/TEAD
COVID-19 dataset	Large Arabic Twitter Dataset on COVID-19	https://github.com/SarahAlqurashi/COVID-19-Arabic-Tweets-Dataset

lewtun · November 25, 2020, 10:12pm

Nice initiative!

It would be cool to add (more to come as I think of them):

Dataset	Paper	Link	Comments
PAN-X / Wikiann	Massively Multilingual Transfer for NER	https://github.com/afshinrahimi/mmner	Although a subset of this dataset is available in the XTREME dataset, XTREME doesn’t have all the languages and forces you to do a clunky manual download.
NOAH’s Corpus of Swiss German Dialects	Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging	https://noe-eva.github.io/NOAH-Corpus/	PoS tagged
The ArchiMob corpus	ArchiMob - A Corpus of Spoken Swiss German	https://drive.switch.ch/index.php/s/vYZv9sNKetuPYTn	PoS tagged, download possible via curl

rafaelsandroni · November 25, 2020, 10:13pm

Adding an initial list for datasets in Portuguese:

Dataset	Paper	Link
b5 corpus	Building a Corpus for Personality-dependent Natural Language Understanding and Generation	https://drive.google.com/file/d/0B-KyU7T8S8bLTHpaMnh2U2NWZzQ/view
BlogSet-BR	BlogSet-BR: A Brazilian Portuguese Blog Corpus	https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/blogset-br/
MilkQA Dataset	MilkQA: a Dataset of Consumer Questions for the Task of Answer Selection	nilc.icmc.usp.br/nilc/index.php/milkqa/

rivenhart · November 25, 2020, 11:20pm

@thomwolf that’s a cool initiative! Would love to be a part of it. Can probably help you with Russian.

Wambui · November 26, 2020, 1:21am

This is a nice initiative. I would like to contribute to swahili - http://opus.nlpl.eu/download.php?f=GoURMET/v1/xml/sw.zip

Vaibhavbrkn · November 26, 2020, 2:35am

@thomwolf I am very excited to participate in this project. And I am thrilled to join and contribute.

mayur627 · November 26, 2020, 3:06am

I would love to contribute in developement.

Robonidos · November 26, 2020, 3:21am

I would like to contribute!

AlenaH · November 26, 2020, 4:02am

Would be fantastic to contribute, +1 on yo be added

imflash217 · November 26, 2020, 5:37am

@thomwolf, amazing. Please count me in too . I would love to do my part for Sanskrit dataset.
Dataset: https://zenodo.org/record/803508#
Paper: https://www.aclweb.org/anthology/W17-2214.pdf

jeromeku · November 26, 2020, 5:58am

Eager to help out!

Topic		Replies	Views
Korean NLP - Introductions Languages at Hugging Face	2	1241	June 27, 2023
HuggingFace 🤗 is all you need for NLP and beyond [BLOG] 🤗Transformers	1	857	May 28, 2022
Collaborating with HuggingFace on Python Integration? Site Feedback	1	20	February 3, 2025
EMNLP Picks from the Hugging Face Science Team Research	1	4067	December 2, 2020
New disk usage quota for Hugging Face users, from December 2024 Beginners	3	182	December 11, 2024

[Open-to-the-community] One week team-effort to reach v2.0 of HF datasets library

Related topics