Open Source survey results [Jan 2022]

Open Source survey results [Jan 2022]

Back in December, we shared the first feedback request for Hugging Face’s Open-Source Ecosystem. While the first two requests for feedback were centered on transformers, this one aimed to gather feedback on all components making up the Hugging Face ecosystem.

Once again, it is thanks to all of your answers that we’re able to steer the directions of libraries and the ecosystem in a direction that fits you all; so we wanted to thank you for sharing your thoughts and a little of your time to let us know what you think, what you like and what you dislike. As a community-focused endeavor it is amazing to see that you all care about the direction in which we’re heading, to see the outpour of positive and encouraging comments, and the very constructive feedback you all are ready to give. From all of us at Hugging Face: Thank you!

With this feedback summary, we aim for something a bit different than for the previous editions: we have crafted roadmaps for three aspects of the feedback request on which you had a lot of comments.

These are the TensorFlow support in the transformers library, the datasets library direction, and the current state of the documentation of transformers.

TensorFlow roadmap

We were delighted to see that the appreciation for the TensorFlow side of the transformers library has greatly improved between this survey and the previous one. The net promoter score nearly doubled, from 14 to 27. We aim to continue in this direction, and the following roadmap was tailored according to your comments and expectations:

Recurring feedback is that while the TensorFlow side of the library has improved over the past months, it is still not on the same level as PyTorch. Two aspects are at play here:

  • A non-negligible number of model architectures available in PyTorch are not so in TensorFlow
  • Some TensorFlow functionality is not working as well as the same in PyTorch

In order to work through this, we’ve identified a few key models and utilities on which we expect progress in the coming months.

TensorFlow is lacking model architectures

See below for the architectures that are yet to be implemented in TensorFlow and that have been asked in the survey/identified by the team as important points of focus.

  • LayoutLM-v2
  • DeBERTa-v3 (now supported!)
  • GPT-J
  • DETR, alongside backbones
  • Wav2Vec2 and Hubert, while available in TensorFlow, are lacking a speech_recognition_ctc.py example script
  • WavLM

Utilities

TensorFlow usage is sometimes sub-par to PyTorch as some features are not as supported, lacking, or unoptimized. In order to improve this for users, here is what we’ll be focusing on in the coming months:

  • The generate method is not as customizable as its PyTorch/Flax counterpart, and it is slower.
  • Transformers’ tokenizers do not accept TensorFlow string tensors as input, which adds an additional necessary transformation.

Datasets roadmap

For the first edition of the survey containing feedback requests relative to datasets, we were excited to see the majority of respondents leverage datasets in their daily workflows, even independently of transformers.

See below for a detailed roadmap of the next steps regarding datasets.

Documentation

We understand that a lot of issues mentioned were linked to lacking, hard to find, or incomplete documentation. We’re in the process of switching from Sphinx to our own frontend and customizing it so that search is better integrated. Alongside it, we aim to:

  • Add an examples folder in the datasets repository with examples of loading/processing/preparing datasets in multiple modalities.
  • Document better how to manage large datasets for optimal performance
  • Docs and examples specifically tailored around vision, speech and time-series datasets
  • We aim to improve dataset search on the hub with better tags/tasks.

Load and share your own data

The dataset scripts remain one of the pain points to contribute datasets on the hub. In order to help users to contribute their datasets in a simpler manner, we’ll work on relaxing the constraints on dataset scripts: having generic/standard loaders, sharing structure among datasets

Streaming datasets

Streaming datasets are currently lacking features that traditional Dataset objects have. We aim to reduce this disparity, by:

  • Aligning the map for streaming and non-streaming datasets
  • Adding additional features for streaming datasets:
    • filter
    • Column manipulation
    • cast
    • others

Metrics shouldn’t be in datasets

The metrics were mentioned as a very positive aspect of datasets, but several respondents raised the point of whether they should be contained in datasets or be split.

We’re happy to mention that we’ve been thinking about splitting the metrics component of datasets in a separate, standalone project.

Transformers Documentation roadmap

We’ve been looking forward to migrating the current documentation to Diátaxis, a new information architecture (IA). Information architecture refers to how content is organized, which impacts how you interact with the documentation. Diátaxis is a user-first framework. It focuses on how you interact with the documentation instead of arranging it based on a product’s features. Based on the feedback, some users have difficulty finding what they’re looking for. The new documentation structure will help address this issue, in addition to making :hugs: Transformers more accessible to a broader background of users.

Coming to a browser near you soon, you will see changes to the documentation structure. It will be easier for you to find exactly what you’re looking for, when you need it. The documentation will be separated into four categories:

  1. Tutorials will teach you the basic skills you need to use :hugs: Transformers. If you are new to the library, we recommend starting here!
  2. How-to guides will show you how to apply your skills to solve specific problems.
  3. Concept guides will explain things and help you understand a topic better. This way you can focus on doing - in the tutorial and how-to guides - without getting too distracted by explanations and abstractions.
  4. Reference will help you as you work, allowing you to look up what you need to know and describe how things work. We will continue our work on ensuring the API documentation is accurate, up-to-date, and includes examples for training and inference.

Thank you once again for helping us steer the ecosystem in a direction that fits your needs. As always, we welcome any and all contributions to our libraries, to the datasets hub, and to the model hub, and we’ll be very happy to guide you if you would like to help on any of the bullet points of the roadmaps defined above, as well as any other (code or other!) contribution.

Please comment on this post if you’d like to help!

15 Likes

There are some great development plans in here! I particularly love the direction of datasets. A wonderful tool that I, too, frequently use, but whose documentation is sometimes hard to fiddle through. Giving more attention to users who wish to load their own, local, datasets with it, is an excellent development as well! Splitting off of the metrics, finally, seems like a sensible idea indeed but probably won’t change too much to how they work (?), so I can only encourage it.

Thanks for doing these surveys from time to time, it shows that HF still has the open-source community in mind, even after the successful funding rounds. Kudos!

2 Likes