Tokenizer taking lot of memory

I am using the BertTokenizerFast from transformers library to encode my text data. I am using pre-trained model for that “bert-base-uncased”. I have a data set of 7 Million rows which is around 2GB. I am using 64GB RAM.

When I am trying to convert into vectors using the model BertModel with same pre-trained model"bert-base-uncased", It cannot convert even 10,000 encoded vectors. It is allocating by saying very high memory needed (100s of GB).

I am using the reference code from here reference blog

You are probably trying to do “all data at once” which of course will not work. Try chunking your data into batches and process them one by one.

1 Like

@BramVanroy Thank you. It worked for me.

What code did you use? I am using the TrainDataset function.