I am benchmarking a small server box based on the SuperMicro E300-8D. I've installed the latest CentOS 7.5 with the latest updates, 64GB of DDR4-2100 RAM, and a Samsung 970 EVO 1TB NVMe SSD. The OS is installed on a USB stick in the internal USB port, so the SSD is entirely unused except during my benchmarking.
The goal of my testing is to find an optimal concurrency level for this SSD, inspired by the benchmarking approach used by ScyllaDB. To that end I'm using diskplorer which internally uses
fio to explore the relationship between concurrency and both IOPS and latency. It produces handy graphs like the ones below. In all cases I'm using a 4K random read workload.
The problem is I'm getting results that make no sense. Here's the first result:
$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G
This is fantastic! Samsung's own spec sheet claims 500K read IOPS and with 20 concurrent reads I'm getting almost 600K. The axis on the right is read latency in nanoseconds, the red line is mean latency, and the error bars are 5% and 95% latency. So it looks like the ideal concurrency level for this SSD is about 20 concurrent reads, yielding awesome latency < 100us.
That's just the raw SSD. I'll put XFS on it, which is optimized for async I/O, and I'm sure it won't add any significant overhead...
With new XFS filesystem on
$ sudo mkfs.xfs /dev/nvme0n1 $ sudo mount /dev/nvme0n1 /mnt $ sudo ./diskplorer.py --mountpoint=/mnt --filesize=256G
What!? That's awful! It seems XFS has introduced some absurd amount of latency and dramatically reduced IOPS. What could be wrong?
Just in case, reboot the system to clear out the caches, not that caching should be a factor on a brand new file system:
/dev/nvme0n1 after reboot
$ sudo shutdown -r now (reboot happens) $ sudo mount /dev/nvme0n1 /mnt $ sudo ./diskplorer.py --mountpoint=/mnt --filesize=256G
No change. It's not cache related.
At this moment there is a valid XFS filesystem on
/dev/nvme0n1, and it is mounted to
/mnt. I'm going to repeat the test I did first, on the raw block device, unmounted, while leaving the contents of the XFS filesystem in place.
$ sudo umount /mnt $ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G
Oh no, XFS ruined my SSD performance!
Clearly, it's not the case that XFS diabolically has ruined my SSD performance, or that XFS is poorly suited for this workload. But what could it be? Even unmounting the disk so XFS isn't involved, performance seems much reduced?
On a hunch, I tried
DISCARDing the entire contents of the SSD which should reset the allocation of cells within the disk to its original state...
$ sudo blkdiscard /dev/nvme0n1 $ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G
Miraculously, the performance of my SSD is restored. Has the whole world gone mad?
Based on a suggestion from @shodanshok, what if I do a
dd onto the SSD after I have "fixed" it by doing a
blkdiscard then zeroed with
$ sudo blkdiscard /dev/nvme0n1 $ sudo dd if=/dev/zero of=/dev/nvme0n1 bs=1M status=progress oflag=direct $ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G
This is an interesting result, and confirms my belief that XFS is not to blame here. Just by filling the SSD with zeroes, read latency and throughput have both significantly deteriorated. So it must be the SSD itself has some optimized read path for unallocated sectors.
Clearly XFS isn't killing my SSD, and if it were,
blkdiscard isn't magically restoring it. I emphasize again these benchmarks are all read benchmarks, so issues with write journaling, write amplification, wear leveling, etc are not applicable.
My theory is that this SSD and perhaps SSDs in general have an optimization in the read path, which detects a read of an unallocated region of the disk and executes a highly optimized code path that sends all zeros back over the PCIe bus.
My question is, does anyone know if that is correct? If so, are benchmarks of new SSDs without filesystems generally suspect, and is this documented anywhere? If this is not correct, does anyone have any other explanation for these bizarre results?