GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability
https://dl.acm.org/doi/abs/10.5555/3433701.3433755
Last updated
https://dl.acm.org/doi/abs/10.5555/3433701.3433755
Last updated
Swapping: due to the resistor
Off the bus: cpu lose connection to GPU
double bit error: error detection (in GPU memory)
Reboot: software update or hardware swap
Aggregate data associated with that GPU, one row per GPU to show the location and period of time
Portion of reboot or entire system? Whole system reboot
Install time: black dot
Blue square: OTB event
Red triangle: GDB event
Square: last seen event of the GPU
2017: failing resistor
Had to do hardware/GPU swap
Installation
New batch: after 17
Old batch: more OTB event
Small gaps during the location, but overarchingly all operations are operational
4 GPUs per blade
Take the blade out and replace the GPU on the blade
Put that blade back in different locations
1 GPU affected
Entire blade is removed
Distribution similar
Number of DBE > OTB
Cage: where the GPU is in
Relative values compared to the other cages
Cage 2 higher than Cage 0 and 1 (hazard ratio)
Column level
Columns were next to service system
Interleaving level from column 0 to 11
Torus: network interconnect at the super computer
Lower column number with lower x coordinate v.s lower column number with high x coordinate has a higher hazard ratio
Job scheduling impacts the overall performance
Workload is not balanced
GPU with lower torus: more workloads
Why?
Cluster is not full (70% of the cluster used)
Which machine?
Assign at the beginning with higher probability
Awareness?
Good outcome: consistently around for 1 is the ideal scenario