Tokenizer taking lot of memory

siva1777 · July 22, 2021, 4:17pm

I am using the BertTokenizerFast from transformers library to encode my text data. I am using pre-trained model for that “bert-base-uncased”. I have a data set of 7 Million rows which is around 2GB. I am using 64GB RAM.

When I am trying to convert into vectors using the model BertModel with same pre-trained model"bert-base-uncased", It cannot convert even 10,000 encoded vectors. It is allocating by saying very high memory needed (100s of GB).

I am using the reference code from here reference blog

BramVanroy · July 22, 2021, 8:14pm

You are probably trying to do “all data at once” which of course will not work. Try chunking your data into batches and process them one by one.

siva1777 · August 30, 2021, 12:03pm

@BramVanroy Thank you. It worked for me.

sr5434 · April 16, 2023, 9:22pm

What code did you use? I am using the TrainDataset function.

Topic		Replies	Views
Tokenizer extremely slow when deployed to a container 🤗Tokenizers	0	1289	April 14, 2023
Fine-tuned BERT tokenizer taking too long to load 🤗Tokenizers	1	3431	August 23, 2022
BERTFastTokenizer: Out of memory Pre-processing sequence Error Intermediate	2	60	November 25, 2024
Using Batch Encodings 🤗Transformers	0	689	July 12, 2022
BERT embeddings on big dataset 🤗Datasets	3	123	August 28, 2024

Tokenizer taking lot of memory

Related topics