I intend to train a language model from scratch. I have been following the tutorial on Hugging Face’s blog. However, when I tried to use their google Colab’s code, I had a memory exceeded error. Their Google Colab’s code uses the LineByLineTextDataset. After digging through the source code, it turns out that LineByLineTextDatasets loads the entire dataset eagerly. No wonder why I had a Memory Exceeded error. My dataset is larger than my RAM capacity.
The article hints that if my dataset is bigger than my capacity, I could “opt to load and tokenize examples on the fly, rather than as a preprocessing step.” However, I’m not certain how to achieve this. I will be very grateful if others can points me to the right direction, especially if I can still use the transformers.Trainer.
You can modify the LineByLineTextDataset to load each line at a time instead of loading the whole file. So basically your __getitem__ should return a single text example
At this line the examples are encoded into vectors. You can disable this and move the encoding to collate_batch function of DataCollatorForLanguageModeling.
In the collate function you can receive a List[str] instead of List[torch.Tensor], so take the list of text examples, encode them and then do the masking
I think this will slow down the training, but you can try. Hope this helps.
To be honest, I am not certain. The blog post hints that we do not need Tensorflow and I successfully reproduced the code after uninstalling Tensorflow in Google Colab (hinting that we’re using PyTorch). However, when I try to reproduce the code in Kaggle, the code only works if I do not uninstall Tensorflow, but maybe that’s because the Kaggle version is reproduced after HF 3.0 is rolled and something internally has changed and now requires Tensorflow
That’s a helpful pointer. I do have a question. Do we know how does the Trainer.__getitem__ is normally accessed? If I know it is accessed incrementally from index=0 until the last index, then maybe I can cache 1 million sentences at a time. But if they access it randomly, then the operation will be very expensive I/O wise since I have to constantly move the pointers.
@Batik so the easiest way to overcome all the RAM limitation will likely be to use the new nlp library which is specifically developed for that.
Jared, a community contributor, is actually adding right now a loader for text dataset which is probably exactly what you’ll need. You can follow the PR here: https://github.com/huggingface/nlp/pull/356
I’ll ping him as well (he’s not yet on the forum I think).
Ran into the same issue as you - TF datasets are greedy by default unless you use tf.data.Dataset.from_generator(), but that can cause performance issues if you’re not careful. I recently opened a PR to the huggingface/nlp library which maps a .txt file into sharded Apache Arrow formats, which can then be read lazily from disk. So after everything gets merged, you could do something like:
from nlp import load_dataset
dset = load_dataset("text", "/path/to/file.txt")["train"]
dset.set_format("tensorflow", columns=["text"])
def dataset_gen():
for ex in dset:
yield ex
tf_dataset = tf.data.Dataset.from_generator(dataset_gen, output_types={"text": tf.string})
I haven’t tested this exact code, but you get the gist. This should be relatively equivalent to LineByLineTextDataset. Follow https://github.com/huggingface/nlp/issues/315 for more info about dataset efficiency.
To add to @jaredtnielsen answers I think you shouldn’t have any issue of memory if you are using PyTorch + nlp, it will be fully lazy loading from the drive, almost nothing in RAM (just your current batch, plus the model of course)
Hello! I just wonder if there is an example for training a MLM with TensorFlow + TPU. I don’t want to train from scratch, but rather train for some additional steps on custom data from an existing model. Thank you.