Map on Open Web Text consumes all RAM memory

Lo1 · March 14, 2022, 8:44am

Hi!
I’m currently trying to process Open Web Text by sentencizing it with SpaCy (https://spacy.io/api/sentencizer).

When I do this by running map, the RAM memory consumption by the launched python processes steadily grows until my computer (a MacBook pro with 16GB RAM) pauses all applications (including the dataset processing) due to RAM memory shortage. It crashes after running for about 2 hours (with approx 2h left). Also, I can’t see on beforehand if the process will run, or eventually consume all RAM memory, so debugging is very time consuming and I don’t really know what is happening.

I have tried to set the datasets.config.IN_MEMORY_MAX_SIZE to something smaller without success. Currently, I’m trying to run the process with batch_size=100 and writer_batch_size=100.

So my question are; how I should deal with RAM memory issues when it comes to Huggingface Datasets? Is setting the writer batch size and batch size to smaller values the golden rule? And, is there some way to anticipate this on beforehand, without doing the time consuming trial-and-error approach? Also, is there some way to monitor the map process to see whether it seems to run as desired?

Side note: I can run all of my code without any issues if I choose a smaller subset of the Open Web Text corpus.

My code can be seen below:

dataset = datasets.load_dataset("openwebtext", cache_dir=args.cache_dir, split="train")

nlp = English()
nlp.add_pipe("sentencizer", config={"punct_chars": [".", "!", "。"]})
chunked_dataset = dataset.map(lambda x: split_to_sentences(x, nlp), batched=True, batch_size=100, writer_batch_size=100, remove_columns=dataset.column_names, cache_file_name=os.path.join(CACHE_DIR, "mappings", "to_sentences.arrow"), num_proc=4, keep_in_memory=False)

And the function I’m using with map is:

def split_to_sentences(examples, sentencizer):
    return_examples = []
    
    for text in examples["text"]:
        for text_part in text.split("\n"):
            for sentence in sentencizer(text_part.strip()).sents:
                sentence = str(sentence).strip()
                if sentence.count(" ") > 0:
                    return_examples += [sentence]
        
    return {"text": return_examples}

mariosasko · March 31, 2022, 12:15pm

Hi!

Is setting the writer batch size and batch size to smaller values the golden rule?

Yes, always try to reduce batch_size/writer_batch_size first.

Also, is there some way to monitor the map process to see whether it seems to run as desired?

What do you mean by “run as desired”? You can use the psutil or the memory_profiler packages to track process stats.

Topic		Replies	Views
Expected memory usage of Dataset Beginners	1	2780	July 4, 2023
Huggingface distilbert-base-uncased-finetuned-sst-2-english runs out of ram with only a few kb? Beginners	0	373	May 12, 2022
Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")` Beginners	3	3476	June 8, 2022
Help! HuggingFace dataset.map() creates unreachable temp files that fill up disks Beginners	1	1053	May 15, 2023
Loading a dataset doesn't actually memory map 🤗Datasets	1	935	September 4, 2023

Map on Open Web Text consumes all RAM memory

Related topics