Let's use this issue to gather instructions on how to profile one's CPU<->NVMe s…etup.
(@tjruwase and I have been editing this post)
You need to do this on every new CPU/NVMe setup in order to configure: [aio](https://www.deepspeed.ai/docs/config-json/#asynchronous-io) param section.
The following NVMe benchmark measures end-to-end performance of how fast it can read/write CPU<->NVMe. so make sure to test this on the actual system that you intend to use it on.
For this demonstration we are going to use:
1. XPG Gammix s11 pro 2tb NVMe drive
2. Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz setup.
## 1. Preparation
```
cd /somewhere/on/nvme/drive/you/want/to/test
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
```
You may have to also install `libaio-dev` if the Deepspeed NVMe driver fails to build. On Ubuntu it's just:
```
apt install libaio-dev
```
Depending on the speed of your NVMe, each benchmark could run for 30min or longer.
Important: make sure you're not doing any other I/O on the device you're testing or you will get incorrect results.
## 2. Run Read Benchmark
```
cd csrc/aio/py_test
dd if=/dev/urandom of=input.file count=400 bs=1M
mkdir read-logs
./run_read_sweep.sh input.file read-logs
python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -1
```
This benchmark assumes the current working directory is on the NVMe drive. If it's not, copy the `csrc/aio/py_test` folder to your NVMe drive and run the test there.
You can, of course, use it to test non-NVMe drivers (e.g. SSD).
The tail of the list should show the fastest speeds.
Here is the best result for the read benchmark:
```
('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208
```
## 3. Run Write Benchmark
```
# cd csrc/aio/py_test
mkdir write-test-data
mkdir write-logs
./run_write_sweep.sh 400 write-test-data write-logs
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -1
```
The write report best result:
```
('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324
```
## 4. Contribute your data
We need more read/write data for various devices to figure out how to make the configuration process automated.
If you're contributing your data, please post:
1. Your NVMe device name/size
2. advertised max read/write spec (google: "device name spec")
3. the results for the last 10 lines i.e.:
```
python parse_aio_stats.py --logdir read-logs/aio_perf_sweep --metric read_speed | sort -k9 -n | tail -10
python parse_aio_stats.py --logdir write-logs/aio_perf_sweep --metric write_speed | sort -k10 -n | tail -10
```
**Important**: please make sure not to do any other I/O on the device under benchmark.
## 5. Derive the `aio` params block
Now we need to figure out how to use the results of the benchmark to configure `aio`.
Here is the final result:
```
"aio": {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
}
```
Most of this config block values come from the benchmarks best results for read and write - i.e. which configuration gives us the highest GB/s throughput (the higher the number the better)
Schema of each line in results is as follows:
- **read**: `read or write | single or block event completion | overlap or sequential event submission | # processes | intra-process parallelism | queue depth | block size | GB/sec
`
- **write**: it's the same as read, plus 2nd column is the size of the written data.
The best read config was:
```
('read', 'block', 'overlap', 1, 1, 32, 262144) = 3.168102406435208
```
which corresponds to `single_submit=false, overlap_events=true, queue_depth=32, block_size=262144`
`single_submit=true` if the 2nd column is `single` instead of `block`.
`overlap_events=false` if the 3rd column is `sequential` instead of `overlap`.
The best write config was :
```
('write', '400MB', 'block', 'overlap', 8, 1, 32, 262144) = 2.5923189261116324,
```
which corresponds to: `single_submit=false, overlap_events=true, queue_depth=32, block_size=262144`
Unfortunately, users don't currently have the ability to have separate read and write configurations, so they need to combine the best of both. Fortunately, in this case, and in most cases, the best read and write configurations are the same or similar.
Reasonable defaults are hard to set because of device and system differences. On many setups we tested `block_size=1M` had consistently seemed optimal across two clusters, but in this particular setup, `block_size=256K` seems to be optimal.
Finally, the last remaining config value is `thread_count=1` is reasonable default, since this is per-rank configuration.
TODO: this config generation can be automated, but need to figure out what to do if read and write top benchmark don't agree.
-----------
Sample stats: for XPG Gammix s11 pro 2tb NVMe drive with published specs of:
- max read speed of up to 3500 MB/s
- max write speed of up to 3000 MB/s
The benchmark records throughput for ~400 different configuration combinations
- read between 1.0-3.17 GB/s,
- write between 1.2-2.59 GB/s
and so now we can choose a single configuration that will lead to the highest throughput for read and write
I tried my 860 Evo SSD and getting ~0.5 GB/s read throughput. So about ~6x slower.
-------------
TODO/Questions to @tjruwase:
[ ] so we have a huge range of numbers - e.g. for read 1 to 3GB/s - so I suppose this is the effective range depending on the kind of task, so the low and the high should be considered - but how does this correlate to training? which of the 400 data points are most relevant? That's too much data for a user to make sense of. Perhaps it should just report and the min and max?
[ ] what are the good numbers? So that the users will know if their NVMe is fast enough? I'm thinking the numbers from the paper?