GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

Introduction - Gray XKY Titan

  • Swapping: due to the resistor

Data Collection & Preprocessing

  • Off the bus: cpu lose connection to GPU

  • double bit error: error detection (in GPU memory)

Reboot: software update or hardware swap

  • Aggregate data associated with that GPU, one row per GPU to show the location and period of time

Portion of reboot or entire system? Whole system reboot

Data Cleansing

Intuition for GPU Lifetimes

  • Install time: black dot

  • Blue square: OTB event

  • Red triangle: GDB event

  • Square: last seen event of the GPU

2017: failing resistor

  • Had to do hardware/GPU swap

  • Installation

New batch: after 17

Old batch: more OTB event

Small gaps during the location, but overarchingly all operations are operational

  • 4 GPUs per blade

    • Take the blade out and replace the GPU on the blade

    • Put that blade back in different locations

  • 1 GPU affected

    • Entire blade is removed

  • Distribution similar

  • Number of DBE > OTB

  • Cage: where the GPU is in

Relative values compared to the other cages

  • Cage 2 higher than Cage 0 and 1 (hazard ratio)

  • Column level

    • Columns were next to service system

  • Interleaving level from column 0 to 11

  • Torus: network interconnect at the super computer

  • Lower column number with lower x coordinate v.s lower column number with high x coordinate has a higher hazard ratio

    • Job scheduling impacts the overall performance

    • Workload is not balanced

      • GPU with lower torus: more workloads

      • Why?

        • Cluster is not full (70% of the cluster used)

        • Which machine?

          • Assign at the beginning with higher probability

      • Awareness?

  • Good outcome: consistently around for 1 is the ideal scenario

Last updated