Im just curious- i want to train multiple models on the same dataloader, kinda similar to vision- is there any way to train a new tokenizer that’s not specific to a model, such that I can run the following workflow?
dataset = load_dataset('wiki')
model1 = AutoModelForMaskedLM.from_pretrained("model1name")
model2 = AutoModelForMaskedLM.from_pretrained("model2name")
### help me with code to tokenize the dataset here
tokenizer = (...) # I would have done a from_pretained here, but am not sure what to do, since model1 and model2 might have different tokenizers
def tokenize_function(examples):
return tokenizer.encode(examples["text"], padding="max_length", truncation=True)
dataset = dataset.map(tokenize_function, batched=True)
dataloader = Dataloader(dataset)
####
for x in dataloader:
y1 = model1(x)
y2 = model2(x)