Map on Open Web Text consumes all RAM memory

Hi!
I’m currently trying to process Open Web Text by sentencizing it with SpaCy (https://spacy.io/api/sentencizer).

When I do this by running map, the RAM memory consumption by the launched python processes steadily grows until my computer (a MacBook pro with 16GB RAM) pauses all applications (including the dataset processing) due to RAM memory shortage. It crashes after running for about 2 hours (with approx 2h left). Also, I can’t see on beforehand if the process will run, or eventually consume all RAM memory, so debugging is very time consuming and I don’t really know what is happening.

I have tried to set the datasets.config.IN_MEMORY_MAX_SIZE to something smaller without success. Currently, I’m trying to run the process with batch_size=100 and writer_batch_size=100.

So my question are; how I should deal with RAM memory issues when it comes to Huggingface Datasets? Is setting the writer batch size and batch size to smaller values the golden rule? And, is there some way to anticipate this on beforehand, without doing the time consuming trial-and-error approach? Also, is there some way to monitor the map process to see whether it seems to run as desired?

Side note: I can run all of my code without any issues if I choose a smaller subset of the Open Web Text corpus.

My code can be seen below:

dataset = datasets.load_dataset("openwebtext", cache_dir=args.cache_dir, split="train")

nlp = English()
nlp.add_pipe("sentencizer", config={"punct_chars": [".", "!", "。"]})
chunked_dataset = dataset.map(lambda x: split_to_sentences(x, nlp), batched=True, batch_size=100, writer_batch_size=100, remove_columns=dataset.column_names, cache_file_name=os.path.join(CACHE_DIR, "mappings", "to_sentences.arrow"), num_proc=4, keep_in_memory=False)

And the function I’m using with map is:

def split_to_sentences(examples, sentencizer):
    return_examples = []
    
    for text in examples["text"]:
        for text_part in text.split("\n"):
            for sentence in sentencizer(text_part.strip()).sents:
                sentence = str(sentence).strip()
                if sentence.count(" ") > 0:
                    return_examples += [sentence]
        
    return {"text": return_examples}

Hi!

Is setting the writer batch size and batch size to smaller values the golden rule?

Yes, always try to reduce batch_size/writer_batch_size first.

Also, is there some way to monitor the map process to see whether it seems to run as desired?

What do you mean by “run as desired”? You can use the psutil or the memory_profiler packages to track process stats.