Please read the topic category description to understand what this is all about
Description
For many online applications it is not known in advance what language an end-user will communicate in. The goal of this project is to build a system that can automatically predict the language a text is written in.
Model(s)
There are a few popular multilingual models that you can start with:
There are quite a few multilingual datasets available on the Hub. Many of these have a “language” field that could be used as a target for the model to predict.
Challenges
This project will likely require you to combine several datasets together to gain enough coverage of many languages.
Desired project outcomes
Create a Streamlit or Gradio app on Spaces that can predict the language of a piece of text provided by an end-user
Don’t forget to push all your models and datasets to the Hub so others can build on them!
Additional resources
A good baseline to compare your model against is the Python langidlibrary
Discord channel
To chat and organise with other people interested in this project, head over to our Discord and:
Follow the instructions on the #join-course channel
Join the #language-detection channel
Just make sure you comment here to indicate that you’ll be contributing to this project
@lewtun Hi, I’m interested in working on this project.
Are we supposed to consider all the languages recognized by the langid library (which we’ll be using as baseline), or is it ok to consider fewer languages ?
Hi @hfawaz, this community event ended last November. Having said that, you’re more than welcome to use the #course:course-event topics as inspiration to build NLP powered applications