Hi all,
We are planning to do one of the biggest team effort we have ever done next week (Nov 30th to Dec 4th) to reach the v2.0
of the datasets
library (Edit: final day extended to next Wednesday Dec 9th!).
The effort will involve more than half of HuggingFace (!) with about 15 people including members who’ve defined the library like @lhoestq @yjernite, @joeddav, @jplu @patrickvonplaten, members of the research team like @teven @VictorSanh and the OSS team like @Narsil, newcomers like @abhishek, awesome part-time members like @aymm and @canwenxu and many others including @madlag or yours truly. (Edit: And now over 200 external participants as well )
It will be targetted toward adding and tagging a large number of NLP datasets to the datasets
library with the goal being to reach +500 datasets and covers and organize as much of the NLP dataset eco-system as we find possible.
We are taking the occasion to develop some tools to more easily add and tag datasets in the library as well as create dataset cards for them.
After internal discussion, we have decided to open this time-limited project to external contributors if you want to have a little taste of what it is to participate in an internal HuggingFace team effort.
Basically, you can ping me or anyone of us and I will add you to the slack channel and give you access to the tools we use as well as detailed information on the workflow and a list of datasets that we think are worth adding.
There might be (Edit: “will definitely be”) a small reward as HuggingFace swag and of course sharing your contribution to this project but keep in mind that this is an open-source effort so join if you want to do an open-contribution and enjoy a bit of HuggingFace vibe, this is not an internship or work offer (for this you should check and apply on our profile on AngelList!). We expect most of the work to be done by the full-time members of HuggingFace but we are also always happy to share how we work and collaborate with external contributors which why we are opening this project.
what is it about:
- we are adding a lot of new datasets to the library (in particular in many NLP tasks and we would like to have more datasets in low ressource languages as well) with the aim to cover as much ground as possible
how you can join:
- post here to say that you want to participate and I will add you to our slack => That’s it
what you’ll get
- enjoy a bit of HuggingFace vibe by joining the team sprint
- receive a special event gift (actually 2 gifts, see this post further down the thread for details!) because it’s really amazing to see the community so involved here that we wanted to remember this event!
BIG UPDATE
We have just updated the deadline to next Wednesday (Dec 9th) So the late comers can still participate!
SECOND BIG UPDATE
A lot of people are still joining (on the way to be 300 participants ) so we are extending a bit the deadline again — though it will a limited extension because we have to end the project at some point
More precisely:
All the participants who will have open at least 1 PR before the end of Wednesday (Dec 9th) can continue adding additional datasets until the end of Sunday (Dec 13th) that will be counted in the sprint.
In other word:
If you have open 1 PR before Wednesday (and thus are eligible for the special event tee-shirt goody ) you will have until the end of Sunday to add 2 others datasets if you want, and join the main-contributors channel of the slack (+ get the special event mug)
Open-sourcely yours,
Thom