CtrlK

GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

https://dl.acm.org/doi/abs/10.5555/3433701.3433755

Introduction - Gray XKY Titan

Swapping: due to the resistor

Data Collection & Preprocessing

Off the bus: cpu lose connection to GPU
double bit error: error detection (in GPU memory)

Reboot: software update or hardware swap

Aggregate data associated with that GPU, one row per GPU to show the location and period of time

Portion of reboot or entire system? Whole system reboot

Data Cleansing

Intuition for GPU Lifetimes

Install time: black dot
Blue square: OTB event
Red triangle: GDB event
Square: last seen event of the GPU

2017: failing resistor

Had to do hardware/GPU swap
Installation

New batch: after 17

Old batch: more OTB event

Small gaps during the location, but overarchingly all operations are operational

4 GPUs per blade
- Take the blade out and replace the GPU on the blade
- Put that blade back in different locations
1 GPU affected
- Entire blade is removed

Distribution similar
Number of DBE > OTB

Cage: where the GPU is in

Relative values compared to the other cages

Cage 2 higher than Cage 0 and 1 (hazard ratio)
Column level
- Columns were next to service system

Interleaving level from column 0 to 11

Torus: network interconnect at the super computer
Lower column number with lower x coordinate v.s lower column number with high x coordinate has a higher hazard ratio
- Job scheduling impacts the overall performance
- Workload is not balanced
  - GPU with lower torus: more workloads
  - Why?
    Cluster is not full (70% of the cluster used)
    Which machine?
    Assign at the beginning with higher probability
  - Awareness?
Good outcome: consistently around for 1 is the ideal scenario

PreviousWavelet: Efficient DNN Training with Tick-Tock Scheduling NextZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training

Last updated 4 years ago

Was this helpful?