Tokenizer.batch_encode_plus uses all my RAM

I only have 25GB RAM and everytime I try to run the below code my google colab crashes. Any idea how to prevent his from happening. Batch wise would work? If so, how does that look like?

max_q_len = 128
max_a_len = 64

def batch_encode(text, max_seq_len):
  return tokenizer.batch_encode_plus(
      text.tolist(),
      max_length = max_seq_len,
      pad_to_max_length=True,
      truncation=True,
      return_token_type_ids=False
  )

# tokenize and encode sequences in the training set
tokensq_train = batch_encode(train_q, max_q_len)
tokens1_train = batch_encode(train_a1, max_a_len)
tokens2_train = batch_encode(train_a2, max_a_len)

My Tokenizer:

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

len(train_q) is 5023194 (which is the same for train_a1 and train_a2)

Are you positive it’s actually the encoding that does it and not some other part of your code? Maybe you can show us the traceback?

@neuralpat Yes. It works with a smaller dataset. Unfortunately, there is no traceback other than “Your session crashed after using all available RAM”. I am using google colab.

I also tried to tokenize and encode only train_q without train_a1 and train_a2 - still crashed.

I then tried this:

    trainq_list = train_q.tolist()    
    batch_size = 50000
    def batch_encode(text, max_seq_len):
      for i in range(0, len(trainq_list), batch_size):
        encoded_sent = tokenizer.batch_encode_plus(
            text,
            max_length = max_seq_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=False
        )
      return encoded_sent

    # tokenize and encode sequences in the training set
    tokensq_train = batch_encode(train_q, max_q_len)

So kind of going through it in batches of size 50000 with the hope of not crashing but didn’t work… it crashed. Any idea how I could tackle this problem?

Just because it works with a smaller dataset, doesn’t mean it’s the tokenization that’s causing the ram issues.
You could try streaming the data from disk, instead of loading it all into ram at once.

Try this:

def batch_encode(text, max_seq_len):

    for i in range(0, len(df["Text"].tolist()), batch_size):

        encoded_sent = tokenizer.batch_encode_plus(
            df["Text"][i : i + batch_size].tolist(),
            max_length=max_seq_len,
            add_special_tokens=True,
            padding="longest",
            return_attention_mask=True,
            pad_to_max_length=True,
            truncation=True,
            return_tensors="pt",
        )

        input_ids_train = encoded_sent["input_ids"].to(device)
        attention_masks_train = encoded_sent["attention_mask"].to(device)
        output = model(input_ids_train, attention_masks_train)

Your problem is that you are passing all the text to batch_encode_plus.

def batch_encode(text, max_seq_len):
    for i in range(0, len(df["Text"].tolist()), batch_size):
        encoded_sent = tokenizer.batch_encode_plus(
            df["Text"][i : i + batch_size].tolist(),
            max_length=max_seq_len,
            add_special_tokens=True,
            padding="longest",
            return_attention_mask=True,
            pad_to_max_length=True,
            truncation=True,
            return_tensors="pt",
        )

        input_ids_train = encoded_sent["input_ids"].to(device)
        attention_masks_train = encoded_sent["attention_mask"].to(device)
        output = model(input_ids_train, attention_masks_train)

Try this. Your ploblem is that you are passing all the text to batch_encode