Train a large transformer with Custom Tokenizer/Data

htlerdo · December 23, 2022, 9:44pm

I’d like to train large transformer model with my own data and tokenizer. I have billions of examples, so I’d like to avoid using anything that’s pre-trained, I just want to train from scratch. Ideally I’d like to use tf/keras (hoping to upload to Google Cloud and train with TPUs there since I already have TPUs).

What’s the easiest way to do this?

I can think of several options:

I dump a dataset where each document is just sequence of integers and the model doesn’t care about tokenization etc… It just learns to predict the next integer given a list of integers. Is there any TF code where I can do this easily? (I expect I will have a few thousand documents where each document have ~million integers/tokens).
I implement my tokenizer and the model allows me to plug-in a custom tokenizer. Is there any TF code where I can do this easily?

Topic		Replies	Views
How to train a model designed by myself with the Transformer Framework Beginners	0	227	August 13, 2023
Trying to use AutoTokenizer with TensorFlow gives: `ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).` 🤗Tokenizers	11	20065	October 5, 2024
How to train TFBertForMaskedLM with TFTrainer Intermediate	1	644	February 23, 2022
How to use transformers&tensorflow for batch inference Beginners	0	528	August 20, 2021
Speed issues using tokenizer.train_new_from_iterator on ~50GB dataset 🤗Transformers	7	2246	November 11, 2024

Train a large transformer with Custom Tokenizer/Data

Related topics