Open Source survey results [Jan 2022]
Back in December, we shared the first feedback request for Hugging Face’s Open-Source Ecosystem. While the first two requests for feedback were centered on transformers
, this one aimed to gather feedback on all components making up the Hugging Face ecosystem.
Once again, it is thanks to all of your answers that we’re able to steer the directions of libraries and the ecosystem in a direction that fits you all; so we wanted to thank you for sharing your thoughts and a little of your time to let us know what you think, what you like and what you dislike. As a community-focused endeavor it is amazing to see that you all care about the direction in which we’re heading, to see the outpour of positive and encouraging comments, and the very constructive feedback you all are ready to give. From all of us at Hugging Face: Thank you!
With this feedback summary, we aim for something a bit different than for the previous editions: we have crafted roadmaps for three aspects of the feedback request on which you had a lot of comments.
These are the TensorFlow support in the transformers
library, the datasets
library direction, and the current state of the documentation of transformers
.
TensorFlow roadmap
We were delighted to see that the appreciation for the TensorFlow side of the transformers
library has greatly improved between this survey and the previous one. The net promoter score nearly doubled, from 14 to 27. We aim to continue in this direction, and the following roadmap was tailored according to your comments and expectations:
Recurring feedback is that while the TensorFlow side of the library has improved over the past months, it is still not on the same level as PyTorch. Two aspects are at play here:
- A non-negligible number of model architectures available in PyTorch are not so in TensorFlow
- Some TensorFlow functionality is not working as well as the same in PyTorch
In order to work through this, we’ve identified a few key models and utilities on which we expect progress in the coming months.
TensorFlow is lacking model architectures
See below for the architectures that are yet to be implemented in TensorFlow and that have been asked in the survey/identified by the team as important points of focus.
- LayoutLM-v2
- DeBERTa-v3 (now supported!)
- GPT-J
- DETR, alongside backbones
- Wav2Vec2 and Hubert, while available in TensorFlow, are lacking a
speech_recognition_ctc.py
example script - WavLM
Utilities
TensorFlow usage is sometimes sub-par to PyTorch as some features are not as supported, lacking, or unoptimized. In order to improve this for users, here is what we’ll be focusing on in the coming months:
- The
generate
method is not as customizable as its PyTorch/Flax counterpart, and it is slower.- Rework the TensorFlow
generate
method so that it is readable, understandable, and customizable. This has been addressed by transformers#15562 - Make
generate
XLA-compatible. This is being addressed by transformers#15786 and transformers#15793
- Rework the TensorFlow
- Transformers’ tokenizers do not accept TensorFlow string tensors as input, which adds an additional necessary transformation.
Datasets roadmap
For the first edition of the survey containing feedback requests relative to datasets
, we were excited to see the majority of respondents leverage datasets
in their daily workflows, even independently of transformers
.
See below for a detailed roadmap of the next steps regarding datasets.
Documentation
We understand that a lot of issues mentioned were linked to lacking, hard to find, or incomplete documentation. We’re in the process of switching from Sphinx to our own frontend and customizing it so that search is better integrated. Alongside it, we aim to:
- Add an
examples
folder in the datasets repository with examples of loading/processing/preparing datasets in multiple modalities. - Document better how to manage large datasets for optimal performance
- Docs and examples specifically tailored around vision, speech and time-series datasets
- We aim to improve dataset search on the hub with better tags/tasks.
Load and share your own data
The dataset scripts remain one of the pain points to contribute datasets on the hub. In order to help users to contribute their datasets in a simpler manner, we’ll work on relaxing the constraints on dataset scripts: having generic/standard loaders, sharing structure among datasets
Streaming datasets
Streaming datasets are currently lacking features that traditional Dataset
objects have. We aim to reduce this disparity, by:
- Aligning the
map
for streaming and non-streaming datasets - Adding additional features for streaming datasets:
filter
- Column manipulation
cast
- others
Metrics shouldn’t be in datasets
The metrics
were mentioned as a very positive aspect of datasets
, but several respondents raised the point of whether they should be contained in datasets
or be split.
We’re happy to mention that we’ve been thinking about splitting the metrics
component of datasets
in a separate, standalone project.
Transformers Documentation roadmap
We’ve been looking forward to migrating the current documentation to Diátaxis, a new information architecture (IA). Information architecture refers to how content is organized, which impacts how you interact with the documentation. Diátaxis is a user-first framework. It focuses on how you interact with the documentation instead of arranging it based on a product’s features. Based on the feedback, some users have difficulty finding what they’re looking for. The new documentation structure will help address this issue, in addition to making Transformers more accessible to a broader background of users.
Coming to a browser near you soon, you will see changes to the documentation structure. It will be easier for you to find exactly what you’re looking for, when you need it. The documentation will be separated into four categories:
- Tutorials will teach you the basic skills you need to use Transformers. If you are new to the library, we recommend starting here!
- How-to guides will show you how to apply your skills to solve specific problems.
- Concept guides will explain things and help you understand a topic better. This way you can focus on doing - in the tutorial and how-to guides - without getting too distracted by explanations and abstractions.
- Reference will help you as you work, allowing you to look up what you need to know and describe how things work. We will continue our work on ensuring the API documentation is accurate, up-to-date, and includes examples for training and inference.
Thank you once again for helping us steer the ecosystem in a direction that fits your needs. As always, we welcome any and all contributions to our libraries, to the datasets hub, and to the model hub, and we’ll be very happy to guide you if you would like to help on any of the bullet points of the roadmaps defined above, as well as any other (code or other!) contribution.
Please comment on this post if you’d like to help!