Wikilangs - Open NLP for 340+ Wikipedia Languages 🌐

Hi everyone! I’m Omar, an ML/NLP researcher from Berlin and I’ve been building Wikilangs, open-source NLP infrastructure and models trained on Wikipedia across 340+ languages, including many that have little to no existing model coverage. :globe_showing_americas::globe_showing_europe_africa::globe_showing_asia_australia:

The project currently provides, for each supported language:

  • Vocabulary, a word list with usage frequencies

  • Custom tokenizers, trained on native Wikipedia text per language

  • N-gram language models, lightweight, fast, usable offline

  • Word embeddings, cross-lingual vector spaces, monolingual and english-aligned

  • Markov chains, to generate all the non-sensical text you ever dreamed of

  • Morphological tokenizers, to make stemming easier (uses an experimental statistical approach)

  • And a Wordle-like game to kill time while your models are training :slight_smile:

You can explore everything at wikilangs.org or install directly:

pip install wikilangs

The hub page is at :hugs: huggingface.co/wikilangs. Each language has its own model card with download stats, training corpus size, and evaluation notes.

This project builds on my dataset :hugs: omarkamali/wikipedia-monthly, which publishes a monthly text corpus for every language on wikipedia (3 years ahead of the official Wikipedia dataset on HF).


Why I’m posting here: I’d love to connect with researchers, engineers, and community members working on NLP for any of these languages, especially low-resource ones. If you’re working on African languages, Arabic, indigenous languages, creoles and pidgins, or any other underrepresented language, I’d genuinely love to hear what’s missing, what’s broken, and what would make these resources actually useful for your work.

A few things I’m actively looking for:

  • Native speakers / language experts who can help validate tokenization quality

  • Collaborators interested in building downstream projects (LLMs, search and semantics, morphological tokenization, user-facing apps and games …)

  • Feedback on any specific language’s model quality or ideas how to take the project further

Drop a reply and introduce yourself, which language(s) are you working on? :backhand_index_pointing_down:

2 Likes