RubberBand: cloud-based hyperparameter tuning
https://dl.acm.org/doi/10.1145/3447786.3456245
Last updated
Was this helpful?
https://dl.acm.org/doi/10.1145/3447786.3456245
Last updated
Was this helpful?
Cloud: different than the traditional datacenter environment (with fixed pool of resources)
HPE: determine the optimal configuration by repeatedly training configs
Early stopping techniques: SHA algorithm
Degradation of the resources (idle resources)
Or increase parallelism (but communication overhead)
Not a problem in a datacenter setting, but is not cost effective in the cloud
Solution: deprovision as it is not needed
But not time effective
Many ML developers are willing to trade-off the efficiency (not scaling) to get a shorter completion time
Solve: given a time constraint, how can we minimize the execution cost of a hyperparameter tuning job
Three challenges
How can we model the job completion time and cost of the given allocation plan
Contributing factors to job completion time: latency, scalability, cloud provider overheads, overall computation structure of the job
Contributing factors to execution cost: price of each resource provisioned, price of data movement, provider billing model
How can we generate a low cost allocation plan that completes on time
How can we schedule said allocation to optimize worker co-location + cluster utilization?
Solution
Cost / performance model via profiling DL model training latency and provisioning overhead
Dag-based execution model which finds feasible and cost-efficient resource allocations
Full-stack system for placement, scheduling, and scaling
Issue 1: modeling job completion time and cost
Data and simulator, and taking the straggler into account
Performance modeling: training latency, provider queueing delay, instance initialization latency
Cost modeling: compute price billing granularity, data price
Offline profiling: 5 mins or so
Simulator: each potential execution plan
Simulate a DAG
Successive halving algorithm with 3 different stages
Account for straggler: sample the training latency for each node type from a distribution which we parameterize by the profiling data
How to solve the straggler issue?
Issue 2: finding a low cost allocation plan
Planner: start with a feasible solution and then the plan is iteratively improved upon
Steps
Generate candidates
Use simulator to predict job completion time
Greedily select best candidate
Iterate with new best candidate
Maximize cost-marginal benefit
M = (cost of current best plan - cost of proposed plan) / (JCT of proposed plan - JCT of current best plan)
Reduce cost without significantly increasing the job completion time
Question: is it portable to all kinds of HPE algorithms? accuracy is also important
Issue 3: effectively execute allocation plan
Rubberband executor: scheduler, cluster manager, placement controller
Scheduler requests cluster manager to provision new nodes or de-provision existing ones
Placement controller converts the resource quantity allocated to each trial to physical resource assignment
Place paralell workers of a trial onto a single machine (or packed into a minimal set of node)
Co-located worker, avoid network overhead
Profile and simulator model job completion time + cost of potential allocations
Planner generates a low cost allocation plan that completes on time
Scheduler placement controller and cluster manager executes the allocation plan such that worker co-location and cluster utilization are maximized