# RubberBand: cloud-based hyperparameter tuning

### Presentation

* Cloud: different than the traditional datacenter environment (with fixed pool of resources)
* HPE: determine the optimal configuration by repeatedly training configs&#x20;
  * Early stopping techniques: SHA algorithm
    * Degradation of the resources (idle resources)&#x20;
    * Or increase parallelism (but communication overhead)&#x20;
* Not a problem in a datacenter setting, but is not cost effective in the cloud&#x20;
* Solution: deprovision as it is not needed&#x20;
  * But not time effective&#x20;
  * Many ML developers are willing to trade-off the efficiency (not scaling) to get a shorter completion time&#x20;
* Solve: given a time constraint, how can we minimize the execution cost of a hyperparameter tuning job&#x20;
* Three **challenges**
  * **How can we model the job completion time and cost of the given allocation plan**
    * Contributing factors to job completion time: latency, scalability, cloud provider overheads, overall computation structure of the job&#x20;
    * Contributing factors to execution cost: price of each resource provisioned, price of data movement, provider billing model&#x20;
  * **How can we generate a low cost allocation plan that completes on time**&#x20;
  * **How can we schedule said allocation to optimize worker co-location + cluster utilization?**&#x20;
* **Solution**&#x20;
  * Cost / performance model via profiling DL model training latency and provisioning overhead
  * Dag-based execution model which finds feasible and cost-efficient resource allocations
  * Full-stack system for placement, scheduling, and scaling
* &#x20;**Issue 1: modeling job completion time and cost**&#x20;
  * Data and simulator, and taking the straggler into account&#x20;
    * Performance modeling: training latency, provider queueing delay, instance initialization latency&#x20;
    * Cost modeling: compute price billing granularity, data price &#x20;
  * Offline profiling: 5 mins or so&#x20;
  * Simulator: each potential execution plan
    * Simulate a DAG
    * ![](/files/M10BUaAcE1QScVfQLbQ8)
    * Successive halving algorithm with 3 different stages&#x20;
      * Account for straggler: sample the training latency for each node type from a distribution which we parameterize by the profiling data&#x20;
      * How to solve the straggler issue?&#x20;
* **Issue 2: finding a low cost allocation plan**
  * Planner: start with a feasible solution and then the plan is iteratively improved upon&#x20;
  * Steps
    * Generate candidates
    * Use simulator to predict job completion time
    * Greedily select best candidate
    * Iterate with new best candidate
  * &#x20;Maximize cost-marginal benefit&#x20;
    * M = (cost of current best plan - cost of proposed plan) / (JCT of proposed plan - JCT of current best plan)&#x20;
    * Reduce cost without significantly increasing the job completion time&#x20;
  * Question: is it portable to all kinds of HPE algorithms? accuracy is also important&#x20;
* **Issue 3: effectively execute allocation plan**
  * Rubberband executor: scheduler, cluster manager, placement controller&#x20;
  * Scheduler requests cluster manager to provision new nodes or de-provision existing ones
  * Placement controller converts the resource quantity allocated to each trial to physical resource assignment&#x20;
    * Place paralell workers of a trial onto a single machine (or packed into a minimal set of node)
    * Co-located worker, avoid network overhead&#x20;

![](/files/XYM5cPnBBRmkOGfSyq9k)

#### System

1. Profile and simulator model job completion time + cost of potential allocations
2. Planner generates a low cost allocation plan that completes on time
3. Scheduler placement controller and cluster manager executes the allocation plan such that worker co-location and cluster utilization are maximized&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sliu583.gitbook.io/blog/cloud-computing/index/cloud-reading-group/rubberband-cloud-based-hyperparameter-tuning.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
