GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
https://dl.acm.org/doi/abs/10.5555/3433701.3433755
Introduction - Gray XKY Titan

Swapping: due to the resistor
Data Collection & Preprocessing

Off the bus: cpu lose connection to GPU
double bit error: error detection (in GPU memory)
Reboot: software update or hardware swap
Aggregate data associated with that GPU, one row per GPU to show the location and period of time
Portion of reboot or entire system? Whole system reboot
Data Cleansing

Intuition for GPU Lifetimes

Install time: black dot
Blue square: OTB event
Red triangle: GDB event
Square: last seen event of the GPU
2017: failing resistor
Had to do hardware/GPU swap
Installation
New batch: after 17
Old batch: more OTB event

Small gaps during the location, but overarchingly all operations are operational
4 GPUs per blade
Take the blade out and replace the GPU on the blade
Put that blade back in different locations
1 GPU affected
Entire blade is removed

Distribution similar
Number of DBE > OTB




Cage: where the GPU is in

Relative values compared to the other cages
Cage 2 higher than Cage 0 and 1 (hazard ratio)
Column level
Columns were next to service system

Interleaving level from column 0 to 11

Torus: network interconnect at the super computer
Lower column number with lower x coordinate v.s lower column number with high x coordinate has a higher hazard ratio
Job scheduling impacts the overall performance
Workload is not balanced
GPU with lower torus: more workloads
Why?
Cluster is not full (70% of the cluster used)
Which machine?
Assign at the beginning with higher probability
Awareness?
Good outcome: consistently around for 1 is the ideal scenario

Last updated
Was this helpful?