# GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

### Introduction - Gray XKY Titan&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkGDEGeTxaGcxeEG1N%2Fimage.png?alt=media\&token=f3a26c3a-7eac-4b24-8caa-5c7c7cf43f1e)

* Swapping: due to the resistor&#x20;

### Data Collection & Preprocessing

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkHDYSRZNX7afX5og1%2Fimage.png?alt=media\&token=78886b5a-749a-4e12-b548-ef08e226dccb)

* Off the bus: cpu lose connection to GPU
* double bit error: error detection (in GPU memory)&#x20;

Reboot: software update or hardware swap&#x20;

* Aggregate data associated with that GPU, one row per GPU to show the location and period of time&#x20;

Portion of reboot or entire system? Whole system reboot&#x20;

### Data Cleansing&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkItBqEe2o0WPFL1p3%2Fimage.png?alt=media\&token=aa8bd4c8-e7ba-4fea-bf19-1cb9e1e2157f)

### Intuition for GPU Lifetimes&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkJQtKvSeT9rAPOlUl%2Fimage.png?alt=media\&token=8ac48d51-ec4e-49d8-b7c0-4fb42b67eedb)

* Install time: black dot&#x20;
* Blue square: OTB event&#x20;
* Red triangle: GDB event&#x20;
* Square: last seen event of the GPU&#x20;

2017: failing resistor&#x20;

* Had to do hardware/GPU swap&#x20;
* Installation

New batch: after 17

Old batch: more OTB event&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkJxJqDr6dylyGKtom%2Fimage.png?alt=media\&token=b7fa9446-6b21-480b-ab5d-35847cc1d994)

Small gaps during the location, but overarchingly all operations are operational&#x20;

* 4 GPUs per blade&#x20;
  * Take the blade out and replace the GPU on the blade&#x20;
  * Put that blade back in different locations&#x20;
* 1 GPU affected&#x20;
  * Entire blade is removed&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkKFN52DDycLpTqenD%2Fimage.png?alt=media\&token=6100eebd-8d6d-4f3f-8fcb-ea6090456916)

* Distribution similar&#x20;
* Number of DBE > OTB&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkKv_5-oEbq4IwfoPK%2Fimage.png?alt=media\&token=aaea374e-c4b6-4c59-86a3-732c64d57410)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkLFTm33evW84UZAig%2Fimage.png?alt=media\&token=472f7ae7-1fcb-41fd-b13a-108e02fa8bb4)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkMAt4lhV42TFt3MhZ%2Fimage.png?alt=media\&token=c5f428b6-b570-4085-b065-646cbd14e1e4)

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkN6fHEAgTBBv56QBI%2Fimage.png?alt=media\&token=37d17462-b963-4e15-8f89-5a8c8937824b)

* Cage: where the GPU is in&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYhLLQX4Rq7MUeNCVLY%2F-MYkOhGLEpS3tgNkzS8b%2Fimage.png?alt=media\&token=d7f9f2d7-fc25-4f80-89c5-19d13d1ebf5f)

Relative values compared to the other cages&#x20;

* Cage 2 higher than Cage 0 and 1 (hazard ratio)&#x20;
* Column level&#x20;
  * Columns were next to service system&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYkPKGLdH0Nb-KQU8mZ%2F-MYkPSD1Q4rwYqavEO_z%2Fimage.png?alt=media\&token=18009b56-e065-4756-b637-60d861c7cf02)

* Interleaving level from column 0 to 11&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYkPKGLdH0Nb-KQU8mZ%2F-MYkPWyn6rd3xZoAIHcj%2Fimage.png?alt=media\&token=de27bd34-f314-4402-921e-76492d8d00c2)

* Torus: network interconnect at the super computer&#x20;
* Lower column number with lower x coordinate v.s lower column number with high x coordinate has a higher hazard ratio&#x20;
  * Job scheduling impacts the overall performance&#x20;
  * Workload is not balanced&#x20;
    * GPU with lower torus: more workloads&#x20;
    * Why?
      * Cluster is not full (70% of the cluster used)&#x20;
      * Which machine?&#x20;
        * Assign at the beginning with higher probability&#x20;
    * Awareness?&#x20;
* Good outcome: consistently around for 1 is the ideal scenario&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MVORxAomcgtzVVUqmws%2F-MYkPKGLdH0Nb-KQU8mZ%2F-MYkQ7jqL50xDp829t9O%2Fimage.png?alt=media\&token=638c29bb-ab42-42f8-9189-cde9fbca9512)
