An Empirical Guide to the Behavior and Use of Scalable Persistent Memory
https://www.usenix.org/conference/fast20/presentation/yang
Presentation
Optane DIMMs
Not just slow dense DRAM
Slower media
More complex architecture
Second-order performance anomalies
Fundamentally different
Outline
Background
Basics: Optane DIMM performance
Lessons: Optane DIMM best practices
Conclusion
Background: Optane in the machine
iMC: integrated memory controller
use optane
Memory mode : use Optane to expand main memory capacity without persistence
Combine Optane DIMM with a conventional DRAM DIMM on the same memory channel that serves as a direct-mapped cache for the NVDIMM
App Direct mode : provides persistence and does not use a DRAM cache
Optane controller: merging 4 cache lines into 256B block and issuing that block to media
Buffer: merging
AIT: address interaction table
Resides in optane media, but also cache on device dram
ADR:
If power failure, then we have enough stand-up power to flush the write pending queue to media
The ADR domain does not include the processor caches, so stores are only persistent once they reach WPQs
WPQ: write pending queue
The iMC communicates wih the Optane DIMM using the DDR-T interface in cache-line (64-byte) granularity
Has to do with the physical features of how to data is laid out
First 4K: optane DIMM 1
Next 4K: optane DIMM 4
Touch all six optane DIMM
Basics: our approach
Sequentially: no access amplification
In contrast, random, 3x as small
Write
Is acknowledged as soon as it hits the on socket integrated memory controller
Not measure all the way to media
Not possible to measure this?
Writing to the cache + flushing it out
Write to cache is the same for both cases
Flush
DRAM
More cores, performance goes up
NI: non-interleave
Read from single DIMM
Bandwidth low
Speculate: slow media, contentions, extra delay because of this contention
Interleave
Write: saturate bandwidth and contention is growing worse
3 cores and slow down
Use all optane dimms, interleave reads (scale well), efficiently
Write (max out) constant with respect to thread counts
Using the block size efficiently
Read: Peak bandwidth at 512B, but valley at 4K (weird contention)
File system use this 4KB size
Lessons: what are Optane best practices?
Avoid small random accesses
Use ntstores for large writes
Limit threads accessing one NVDIMM
Avoid mixed and multi-threaded NUMA accesses
Lesson 1: avoid small random accesses
Small accesses: smaller than 256B, lose bandwidth
Vary the working set size
How large this on-device buffer is? About 16K
Lesson 2: Use ntstores for large writes
ntstore (non-temporal store): bypass the cache hierarchy and issue directly to the backing DIMM
store + clwb: doing Aa store and then evicting that cache line or cleaning the cache line using a cache line write back (CLWB)
Con: Lost bandwidth
Preserve the sequential access if all possible
Doing a read plus a write, using double the bandwidth
store: trickle out from the cache into media
Con: Lost locality
Cache is going to evict the cache line when it decides to, not optimize for evicting things in a sequential pattern (i.e. introduce randomness) --> terrible device utilization
Lesson #3: Limit threads accessing one NVDIMM
Contention at Optane Buffer: merging adjacent accesses
Contention at iMC
Contention at Optane Buffer
Multiple threads: thrash on this on-device cache
What would otherwise can be sequential access that can be merged is no longer the case
All threads are putting their own writes, lose the opportunity to merge
Read
Spike: max bandwidth
Add more: thrashing the cache
Multiple threads - multiple DIMM - clogging the single DIMM (bandwidth falls down)
Ends up with stall
Exactly interleave size: burst of the accesses (hit the memory controller and the same DIMM)
Lesson #4: avoid mixed and multi-threaded NUMA accesses
Last updated
Was this helpful?