Build a language detector

lewtun · November 9, 2021, 7:11pm

Please read the topic category description to understand what this is all about

Description

For many online applications it is not known in advance what language an end-user will communicate in. The goal of this project is to build a system that can automatically predict the language a text is written in.

Model(s)

There are a few popular multilingual models that you can start with:

Datasets

There are quite a few multilingual datasets available on the Hub. Many of these have a “language” field that could be used as a target for the model to predict.

Challenges

This project will likely require you to combine several datasets together to gain enough coverage of many languages.

Desired project outcomes

Create a Streamlit or Gradio app on Spaces that can predict the language of a piece of text provided by an end-user
Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

A good baseline to compare your model against is the Python langid library

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

Follow the instructions on the #join-course channel
Join the #language-detection channel

Just make sure you comment here to indicate that you’ll be contributing to this project

ivanlau · November 16, 2021, 5:27am

@lewtun
Hi, I would like to work on this project.

lewtun · November 16, 2021, 9:38am

Cool to hear @ivanlau ! I’ve created a Discord channel (see topic description) in case you and others want to use it

syri · November 17, 2021, 10:00pm

@lewtun Hi, I’m interested in working on this project.
Are we supposed to consider all the languages recognized by the langid library (which we’ll be using as baseline), or is it ok to consider fewer languages ?

lewtun · November 17, 2021, 10:02pm

It’s totally fine for you to choose the scope for the project I agree that working with a few languages is a great way to start!

imimen · November 17, 2021, 10:09pm

Hi @lewtun ! I want to work on this project.

syri · November 17, 2021, 10:09pm

Alright thanks !! I’ll do so

papluca · November 17, 2021, 10:19pm

Hi, I’d be glad to join this project. Am I still in time?

lewtun · November 17, 2021, 10:21pm

I think you’re the 4th person, so yes there’s still time and space on the team!

hfawaz · January 26, 2022, 10:28am

hi is this project still on ?

lewtun · January 26, 2022, 10:30am

Hi @hfawaz, this community event ended last November. Having said that, you’re more than welcome to use the #course:course-event topics as inspiration to build NLP powered applications

hfawaz · January 26, 2022, 10:30am

thanks, do you have any shared open ressources that you guys found following this course ?

lewtun · January 26, 2022, 10:44am

You might find the official tutorials in the transformers library to be helpful: 🤗 Transformers Notebooks

Topic		Replies	Views
Create a multilingual classifier 🤗 Course Projects	3	1513	October 22, 2024
Ways to detect language of the given text? Models	5	6138	June 19, 2021
Create a NER tagger for African languages 🤗 Course Projects	4	851	November 15, 2021
Use EncoderDecoder models for text summarization 🤗 Course Projects	3	2404	December 28, 2023
Create a detector of toxicity from political tweets in Spain 🤗 Course Projects	2	825	November 17, 2021