I’d like to train model on my custom dataset(I don’t want to use pretrained tokenizer and model). According to my understanding I should do following:
- Selecting desired network architecture - for example let it be ‘ibert’
- Get coresponding config file: config = AutoConfig.for_model(‘ibert’)
- Create model from config: model = AutoModel.from_config(config)
- Create tokenizer
Here is first questin - can I create and train any tokenizer from ‘tokenizers’ pakage when I train own model from scratch?
The second question - if I’d like to get tokenizer config from existing model, for example ‘allenai/scibert_scivocab_uncased’ - how can I do it? I don’t need pretrained tokenizer - I’d like to train it on my own dataset from scratch?