An Empirical Guide to the Behavior and Use of Scalable Persistent Memory

https://www.usenix.org/conference/fast20/presentation/yang

Optane DIMMs
- Not just slow dense DRAM
- Slower media
  - More complex architecture
    Second-order performance anomalies
    Fundamentally different
Outline
- Background
- Basics: Optane DIMM performance
- Lessons: Optane DIMM best practices
- Conclusion

iMC: integrated memory controller
use optane
- Memory mode : use Optane to expand main memory capacity without persistence
  - Combine Optane DIMM with a conventional DRAM DIMM on the same memory channel that serves as a direct-mapped cache for the NVDIMM
- App Direct mode : provides persistence and does not use a DRAM cache

Optane controller: merging 4 cache lines into 256B block and issuing that block to media
- Buffer: merging
AIT: address interaction table
- Resides in optane media, but also cache on device dram
ADR:
- If power failure, then we have enough stand-up power to flush the write pending queue to media
  - The ADR domain does not include the processor caches, so stores are only persistent once they reach WPQs
- WPQ: write pending queue
- The iMC communicates wih the Optane DIMM using the DDR-T interface in cache-line (64-byte) granularity
  - Has to do with the physical features of how to data is laid out

DRAM
- More cores, performance goes up
NI: non-interleave
- Read from single DIMM
- Bandwidth low
- Speculate: slow media, contentions, extra delay because of this contention
Interleave
- Write: saturate bandwidth and contention is growing worse
  - 3 cores and slow down
Use all optane dimms, interleave reads (scale well), efficiently
Write (max out) constant with respect to thread counts

Read: Peak bandwidth at 512B, but valley at 4K (weird contention)
- File system use this 4KB size

Lesson 1: avoid small random accesses

Lesson 2: Use ntstores for large writes

ntstore (non-temporal store): bypass the cache hierarchy and issue directly to the backing DIMM
store + clwb: doing Aa store and then evicting that cache line or cleaning the cache line using a cache line write back (CLWB)
- Con: Lost bandwidth
- Preserve the sequential access if all possible
- Doing a read plus a write, using double the bandwidth
store: trickle out from the cache into media
- Con: Lost locality
- Cache is going to evict the cache line when it decides to, not optimize for evicting things in a sequential pattern (i.e. introduce randomness) --> terrible device utilization

Lesson #3: Limit threads accessing one NVDIMM

Contention at Optane Buffer