RubberBand: cloud-based hyperparameter tuning

https://dl.acm.org/doi/10.1145/3447786.3456245

Presentation

Cloud: different than the traditional datacenter environment (with fixed pool of resources)
HPE: determine the optimal configuration by repeatedly training configs
- Early stopping techniques: SHA algorithm
  - Degradation of the resources (idle resources)
  - Or increase parallelism (but communication overhead)
Not a problem in a datacenter setting, but is not cost effective in the cloud
Solution: deprovision as it is not needed
- But not time effective
- Many ML developers are willing to trade-off the efficiency (not scaling) to get a shorter completion time
Solve: given a time constraint, how can we minimize the execution cost of a hyperparameter tuning job
Three challenges
- How can we model the job completion time and cost of the given allocation plan
  - Contributing factors to job completion time: latency, scalability, cloud provider overheads, overall computation structure of the job
  - Contributing factors to execution cost: price of each resource provisioned, price of data movement, provider billing model
- How can we generate a low cost allocation plan that completes on time
- How can we schedule said allocation to optimize worker co-location + cluster utilization?
Solution
- Cost / performance model via profiling DL model training latency and provisioning overhead
- Dag-based execution model which finds feasible and cost-efficient resource allocations
- Full-stack system for placement, scheduling, and scaling
Issue 1: modeling job completion time and cost
- Data and simulator, and taking the straggler into account
  - Performance modeling: training latency, provider queueing delay, instance initialization latency
  - Cost modeling: compute price billing granularity, data price
- Offline profiling: 5 mins or so
- Simulator: each potential execution plan
  - Simulate a DAG
  - Successive halving algorithm with 3 different stages
    Account for straggler: sample the training latency for each node type from a distribution which we parameterize by the profiling data
    How to solve the straggler issue?
Issue 2: finding a low cost allocation plan
- Planner: start with a feasible solution and then the plan is iteratively improved upon
- Steps
  - Generate candidates
  - Use simulator to predict job completion time
  - Greedily select best candidate
  - Iterate with new best candidate
- Maximize cost-marginal benefit
  - M = (cost of current best plan - cost of proposed plan) / (JCT of proposed plan - JCT of current best plan)
  - Reduce cost without significantly increasing the job completion time
- Question: is it portable to all kinds of HPE algorithms? accuracy is also important
Issue 3: effectively execute allocation plan
- Rubberband executor: scheduler, cluster manager, placement controller
- Scheduler requests cluster manager to provision new nodes or de-provision existing ones
- Placement controller converts the resource quantity allocated to each trial to physical resource assignment
  - Place paralell workers of a trial onto a single machine (or packed into a minimal set of node)
  - Co-located worker, avoid network overhead

System

Profile and simulator model job completion time + cost of potential allocations
Planner generates a low cost allocation plan that completes on time
Scheduler placement controller and cluster manager executes the allocation plan such that worker co-location and cluster utilization are maximized

PreviousOwl: Scale and Flexibility in Distribution of Hot Contents NextDistributed Systems Lecture Series

Last updated 3 years ago

Was this helpful?