[Deepspeed ZeRO-Infinity] looking for NVMe device benchmarks

stas · April 26, 2021, 6:20pm

As you may have read that you can extend your CPU memory with NVMe in the latest Deepspeed release (and the integration has been just made available in transformers master).

We are trying to figure how to make the configuration of the NVMe IO section most efficient and need more data from various NVMe devices.

If you have an NVMe device and don’t mind running an approximately 1h benchmark on it while not doing any other IO on it, please follow the instructions here:

github.com/microsoft/DeepSpeed

[doc] profiling NVMe and configuring `aio` param section

opened 01:18AM - 23 Apr 21 UTC

stas00

Let's use this issue to gather instructions on how to profile one's CPU<->NVMe s…etup. (@tjruwase and I have been editing this post) You need to do this on every new CPU/NVMe setup in order to configure: [aio](https://www.deepspeed.ai/docs/config-json/#asynchronous-io) param section. The following NVMe benchmark measures end-to-end performance of how fast it can read/write CPU<->NVMe. so make sure to test this on the actual system that you intend to use it on. For this demonstration we are going to use: 1. XPG Gammix s11 pro 2tb NVMe drive 2. Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz setup. ## 1. Preparation ``` cd /somewhere/on/nvme/drive/you/want/to/test git clone https://github.com/microsoft/DeepSpeed cd DeepSpeed ``` You may have to also install `libaio-dev` if the Deepspeed NVMe driver fails to build. On Ubuntu it's just: ``` apt install libaio-dev ``` Depending on the speed of your NVMe, each benchmark could run for 30min or longer. Important: make sure you're not doing any other I/O on the device you're testing or you will get incorrect results. ## 2. Run Read Benchmark ``` cd csrc/aio/py_test dd if=/dev/urandom of=input.file count=400 bs=1M mkdir read-logs ./run_read_sweep.sh input.file read-logs python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1 ``` This benchmark assumes the current working directory is on the NVMe drive. If it's not, copy the `csrc/aio/py_test` folder to your NVMe drive and run the test there. You can, of course, use it to test non-NVMe drivers (e.g. SSD). The tail of the list should show the fastest speeds. Here is the best result for the read benchmark: ``` ('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208 ``` ## 3. Run Write Benchmark ``` # cd csrc/aio/py_test mkdir write-test-data mkdir write-logs ./run_write_sweep.sh 400 write-test-data write-logs python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -1 ``` The write report best result: ``` ('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324 ``` ## 4. Contribute your data We need more read/write data for various devices to figure out how to make the configuration process automated. If you're contributing your data, please post: 1. Your NVMe device name/size 2. advertised max read/write spec (google: "device name spec") 3. the results for the last 10 lines i.e.: ``` python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -10 python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10 ``` **Important**: please make sure not to do any other I/O on the device under benchmark. ## 5. Derive the `aio` params block Now we need to figure out how to use the results of the benchmark to configure `aio`. Here is the final result: ``` "aio": { "block_size": 262144, "queue_depth": 32, "thread_count": 1, "single_submit": false, "overlap_events": true } ``` Most of this config block values come from the benchmarks best results for read and write - i.e. which configuration gives us the highest GB/s throughput (the higher the number the better) Schema of each line in results is as follows: - **read**: `read or write | single or block event completion | overlap or sequential event submission | # processes | intra-process parallelism | queue depth | block size | GB/sec ` - **write**: it's the same as read, plus 2nd column is the size of the written data. The best read config was: ``` ('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208 ``` which corresponds to `single_submit=false, overlap_events=true, queue_depth=32, block_size=262144` `single_submit=true` if the 2nd column is `single` instead of `block`. `overlap_events=false` if the 3rd column is `sequential` instead of `overlap`. The best write config was : ``` ('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324, ``` which corresponds to: `single_submit=false, overlap_events=true, queue_depth=32, block_size=262144` Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar. Reasonable defaults are hard to set because of device and system differences. On many setups we tested `block_size=1M` had consistently seemed optimal across two clusters, but in this particular setup, `block_size=256K` seems to be optimal. Finally, the last remaining config value is `thread_count=1` is reasonable default, since this is per-rank configuration. TODO: this config generation can be automated, but need to figure out what to do if read and write top benchmark don't agree. ----------- Sample stats: for XPG Gammix s11 pro 2tb NVMe drive with published specs of: - max read speed of up to 3500 MB/s - max write speed of up to 3000 MB/s The benchmark records throughput for ~400 different configuration combinations - read between 1.0-3.17 GB/s, - write between 1.2-2.59 GB/s and so now we can choose a single configuration that will lead to the highest throughput for read and write I tried my 860 Evo SSD and getting ~0.5 GB/s read throughput. So about ~6x slower. ------------- TODO/Questions to @tjruwase: [ ] so we have a huge range of numbers - e.g. for read 1 to 3GB/s - so I suppose this is the effective range depending on the kind of task, so the low and the high should be considered - but how does this correlate to training? which of the 400 data points are most relevant? That's too much data for a user to make sense of. Perhaps it should just report and the min and max? [ ] what are the good numbers? So that the users will know if their NVMe is fast enough? I'm thinking the numbers from the paper?

and post your results in the comments of that issue.

Real time involvement is probably around ~5-10min of your time to set up the benchmark and share back the results.

Thank you very much!

Topic		Replies	Views
LED Memory Requirements Models	0	399	June 16, 2022
Fine-tuning a 16B CodeGen model with 256GB RAM+2xA6000s? DeepSpeed	2	1647	July 3, 2023
Fine-tuning T5 with long sequence length, using activation checkpointing with Deepspeed 🤗Transformers	6	2882	December 5, 2022
ZeRO3 with int8 training DeepSpeed	0	879	August 11, 2023
🚀 Bringing Supercomputer-Grade AI Performance to Local CPUs: Purem Benchmarks Now Public Show and Tell	0	15	April 28, 2025

[Deepspeed ZeRO-Infinity] looking for NVMe device benchmarks

Related topics