I have an innovative solution that can boost Hugging Face developing models in various languages. Currently, we face a significant challenge: while we have a vast dataset for the English language, we have a considerable scarcity of data for users of other nationalities. This limits our ability to create models in different languages and hinders the inclusion of a variety of users.
To overcome this difficulty, I propose the creation of a service integrated into the Huggingface dataset that will allow direct translation of the data. The idea is to utilize a highly efficient translation model to perform this task. With this solution, we will have the opportunity to encourage and facilitate the development of models in various languages.
By incorporating this functionality, we will broaden the reach and usefulness of Huggingface for users of different nationalities. Imagine the possibility of training and deploying high-quality models in French, Spanish, Portuguese, Mandarin, and many other languages, opening up a new world of opportunities and promoting greater global inclusion.
I am excited about this solution and would like to discuss further details regarding implementation and the necessary tools. I intend to use a robust translation model like Huggingface’s Transformer, which has exceptional performance in translation tasks. Additionally, we can explore other complementary technologies such as data preprocessing and automatic post-editing to further enhance the results.
I am available to exchange ideas and collaborate with colleagues interested in this initiative. Together, we can transform Huggingface into a truly global platform, empowering users of all nationalities to benefit from language models.
It makes more sense to implement something like this as a space on the Hub rather than as a datasets
feature.