Corpus for pre train bert base chinese

rocinant2023 · September 14, 2024, 2:16am

I want to actually distill bert base chinese model from google. But can find the dataset used to do pre train bert base chinese. Chinese corpus is kind of rare on internet. So any information is appreciated

John6666 · September 14, 2024, 1:33pm

True. I followed the link and the Chinese dataset doesn’t come up…

https://yknzhu.wixsite.com/mbweb

I see a lot of Chinese people on HF (maybe), so I’m sure you can find someone who might know and ask them directly by opening a Discussion… or send a mention and they’ll notice.
I’ve sorted the Chinese datasets by popularity, but I wonder if any of them are above this one.

By the way, how to look for them, for example, if you follow the people who like the following data set (actually, you can), you have a pretty good chance that they are Chinese speakers. Well, not me.

Topic		Replies	Views
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4639	January 22, 2021
Bert Data Preparation Beginners	1	449	November 8, 2021
Pre-trained model with open source train test splits Beginners	0	234	November 19, 2021
Sharing BERT formatted corpus Intermediate	7	1743	September 15, 2020
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3865	January 30, 2022

Corpus for pre train bert base chinese

Related topics