[Help wanted] Common Crawl needs help to be richer & more multilingual

The Common Crawl team is looking to make CC richer culturally and more multilingual and they asked us for help to improve their crawl.

Since Common Crawl is used to train most LLMs out there, this would enrich the cultural and linguistic knowledge of future LLMs in many topics and languages !

They asked us to help curating a small list of websites for every language (and also every culture and community) in the world. The results will feed back through the main crawl and the new crawl.

If you can take a few minutes to think about quality websites in your language, please share them with the Common Crawl team.

Moreover every contributed website actually contributes much more content than you think ! Indeed very URL serves as a seed for the crawls, meaning that Common Crawl will crawl the website’s content but also the content of every connected website.

If you have questions to ask to the Common Crawl team about this Web Languages project (or their other annotation project), here is the main channel for this on discord: Common Crawl’s Web Languages and LangID projects

5 Likes