An Empirical Guide to the Behavior and Use of Scalable Persistent Memory

https://www.usenix.org/conference/fast20/presentation/yang

Presentation

  • Optane DIMMs

    • Not just slow dense DRAM

    • Slower media

      • More complex architecture

        • Second-order performance anomalies

          • Fundamentally different

  • Outline

    • Background

    • Basics: Optane DIMM performance

    • Lessons: Optane DIMM best practices

    • Conclusion

Background: Optane in the machine

  • iMC: integrated memory controller

  • use optane

    • Memory mode : use Optane to expand main memory capacity without persistence

      • Combine Optane DIMM with a conventional DRAM DIMM on the same memory channel that serves as a direct-mapped cache for the NVDIMM

    • App Direct mode : provides persistence and does not use a DRAM cache

  • Optane controller: merging 4 cache lines into 256B block and issuing that block to media

    • Buffer: merging

  • AIT: address interaction table

    • Resides in optane media, but also cache on device dram

  • ADR:

    • If power failure, then we have enough stand-up power to flush the write pending queue to media

      • The ADR domain does not include the processor caches, so stores are only persistent once they reach WPQs

    • WPQ: write pending queue

    • The iMC communicates wih the Optane DIMM using the DDR-T interface in cache-line (64-byte) granularity

      • Has to do with the physical features of how to data is laid out

  • First 4K: optane DIMM 1

  • Next 4K: optane DIMM 4

  • Touch all six optane DIMM

Basics: our approach

  • Sequentially: no access amplification

    • In contrast, random, 3x as small

  • Write

    • Is acknowledged as soon as it hits the on socket integrated memory controller

      • Not measure all the way to media

        • Not possible to measure this?

      • Writing to the cache + flushing it out

        • Write to cache is the same for both cases

        • Flush

  • DRAM

    • More cores, performance goes up

  • NI: non-interleave

    • Read from single DIMM

    • Bandwidth low

    • Speculate: slow media, contentions, extra delay because of this contention

  • Interleave

    • Write: saturate bandwidth and contention is growing worse

      • 3 cores and slow down

  • Use all optane dimms, interleave reads (scale well), efficiently

  • Write (max out) constant with respect to thread counts

  • Using the block size efficiently

  • Read: Peak bandwidth at 512B, but valley at 4K (weird contention)

    • File system use this 4KB size

Lessons: what are Optane best practices?

  • Avoid small random accesses

  • Use ntstores for large writes

  • Limit threads accessing one NVDIMM

  • Avoid mixed and multi-threaded NUMA accesses

Lesson 1: avoid small random accesses

  • Small accesses: smaller than 256B, lose bandwidth

  • Vary the working set size

    • How large this on-device buffer is? About 16K

Lesson 2: Use ntstores for large writes

  • ntstore (non-temporal store): bypass the cache hierarchy and issue directly to the backing DIMM

  • store + clwb: doing Aa store and then evicting that cache line or cleaning the cache line using a cache line write back (CLWB)

    • Con: Lost bandwidth

    • Preserve the sequential access if all possible

    • Doing a read plus a write, using double the bandwidth

  • store: trickle out from the cache into media

    • Con: Lost locality

    • Cache is going to evict the cache line when it decides to, not optimize for evicting things in a sequential pattern (i.e. introduce randomness) --> terrible device utilization

Lesson #3: Limit threads accessing one NVDIMM

  • Contention at Optane Buffer: merging adjacent accesses

  • Contention at iMC

Contention at Optane Buffer

  • Multiple threads: thrash on this on-device cache

    • What would otherwise can be sequential access that can be merged is no longer the case

    • All threads are putting their own writes, lose the opportunity to merge

    • Read

      • Spike: max bandwidth

      • Add more: thrashing the cache

  • Multiple threads - multiple DIMM - clogging the single DIMM (bandwidth falls down)

  • Ends up with stall

  • Exactly interleave size: burst of the accesses (hit the memory controller and the same DIMM)

Lesson #4: avoid mixed and multi-threaded NUMA accesses

Last updated