# RubberBand: cloud-based hyperparameter tuning

### Presentation

* Cloud: different than the traditional datacenter environment (with fixed pool of resources)
* HPE: determine the optimal configuration by repeatedly training configs&#x20;
  * Early stopping techniques: SHA algorithm
    * Degradation of the resources (idle resources)&#x20;
    * Or increase parallelism (but communication overhead)&#x20;
* Not a problem in a datacenter setting, but is not cost effective in the cloud&#x20;
* Solution: deprovision as it is not needed&#x20;
  * But not time effective&#x20;
  * Many ML developers are willing to trade-off the efficiency (not scaling) to get a shorter completion time&#x20;
* Solve: given a time constraint, how can we minimize the execution cost of a hyperparameter tuning job&#x20;
* Three **challenges**
  * **How can we model the job completion time and cost of the given allocation plan**
    * Contributing factors to job completion time: latency, scalability, cloud provider overheads, overall computation structure of the job&#x20;
    * Contributing factors to execution cost: price of each resource provisioned, price of data movement, provider billing model&#x20;
  * **How can we generate a low cost allocation plan that completes on time**&#x20;
  * **How can we schedule said allocation to optimize worker co-location + cluster utilization?**&#x20;
* **Solution**&#x20;
  * Cost / performance model via profiling DL model training latency and provisioning overhead
  * Dag-based execution model which finds feasible and cost-efficient resource allocations
  * Full-stack system for placement, scheduling, and scaling
* &#x20;**Issue 1: modeling job completion time and cost**&#x20;
  * Data and simulator, and taking the straggler into account&#x20;
    * Performance modeling: training latency, provider queueing delay, instance initialization latency&#x20;
    * Cost modeling: compute price billing granularity, data price &#x20;
  * Offline profiling: 5 mins or so&#x20;
  * Simulator: each potential execution plan
    * Simulate a DAG
    * ![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2F1G9KlMOCvTFjJTsNxstO%2Fimage.png?alt=media\&token=2e66d2a5-f31e-4e0b-b140-d8751c1cf731)
    * Successive halving algorithm with 3 different stages&#x20;
      * Account for straggler: sample the training latency for each node type from a distribution which we parameterize by the profiling data&#x20;
      * How to solve the straggler issue?&#x20;
* **Issue 2: finding a low cost allocation plan**
  * Planner: start with a feasible solution and then the plan is iteratively improved upon&#x20;
  * Steps
    * Generate candidates
    * Use simulator to predict job completion time
    * Greedily select best candidate
    * Iterate with new best candidate
  * &#x20;Maximize cost-marginal benefit&#x20;
    * M = (cost of current best plan - cost of proposed plan) / (JCT of proposed plan - JCT of current best plan)&#x20;
    * Reduce cost without significantly increasing the job completion time&#x20;
  * Question: is it portable to all kinds of HPE algorithms? accuracy is also important&#x20;
* **Issue 3: effectively execute allocation plan**
  * Rubberband executor: scheduler, cluster manager, placement controller&#x20;
  * Scheduler requests cluster manager to provision new nodes or de-provision existing ones
  * Placement controller converts the resource quantity allocated to each trial to physical resource assignment&#x20;
    * Place paralell workers of a trial onto a single machine (or packed into a minimal set of node)
    * Co-located worker, avoid network overhead&#x20;

![](https://2097630930-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MVORxAomcgtzVVUqmws%2Fuploads%2F5etAxybsdfMTnib9EY7Q%2Fimage.png?alt=media\&token=2250dcfb-b2e2-4623-b72b-78090db9c1cf)

#### System

1. Profile and simulator model job completion time + cost of potential allocations
2. Planner generates a low cost allocation plan that completes on time
3. Scheduler placement controller and cluster manager executes the allocation plan such that worker co-location and cluster utilization are maximized&#x20;
