BERT for Word Segmentation

johanna-k · May 19, 2022, 8:00am

Hey there!

I’m currently comparing different ways of word segmentation.
I was wondering if I could simply fine-tune a pre-trained BERT and add like a classification layer on top, so that giving an expression it’s decomposed into its basewords, by labeling each char with b beginning, m middle, e end.
thesunflower123 ->[‘the’, ‘##sun’, ‘##flower’, ‘##12’, ‘##3’] → bmebmmmmmmmebme

I found some papers about Chinese word segmentation and tried to adapt some tutorials. But I’m not quite sure with which pre-trained model I should start or how to train it properly (on sentences?).

Would be thankful for any tips!

Topic		Replies	Views
Pre-training a BERT model from scratch with custom tokenizer Intermediate	5	3091	January 11, 2022
Training BERT model from scratch with custom sequence Beginners	0	392	September 21, 2022
How to fine-tune BERT model for next word prediction? Beginners	0	1112	October 3, 2021
Chunks of word for input token Beginners	0	326	April 16, 2021
Any BERT model recommendation needed for getting feature of structured sentences Beginners	0	397	June 8, 2022

BERT for Word Segmentation

Related topics