I want to actually distill bert base chinese model from google. But can find the dataset used to do pre train bert base chinese. Chinese corpus is kind of rare on internet. So any information is appreciated
True. I followed the link and the Chinese dataset doesn’t come up…
https://yknzhu.wixsite.com/mbweb
I see a lot of Chinese people on HF (maybe), so I’m sure you can find someone who might know and ask them directly by opening a Discussion… or send a mention and they’ll notice.
I’ve sorted the Chinese datasets by popularity, but I wonder if any of them are above this one.
By the way, how to look for them, for example, if you follow the people who like the following data set (actually, you can), you have a pretty good chance that they are Chinese speakers. Well, not me.