RubberBand: cloud-based hyperparameter tuning


  • Cloud: different than the traditional datacenter environment (with fixed pool of resources)

  • HPE: determine the optimal configuration by repeatedly training configs

    • Early stopping techniques: SHA algorithm

      • Degradation of the resources (idle resources)

      • Or increase parallelism (but communication overhead)

  • Not a problem in a datacenter setting, but is not cost effective in the cloud

  • Solution: deprovision as it is not needed

    • But not time effective

    • Many ML developers are willing to trade-off the efficiency (not scaling) to get a shorter completion time

  • Solve: given a time constraint, how can we minimize the execution cost of a hyperparameter tuning job

  • Three challenges

    • How can we model the job completion time and cost of the given allocation plan

      • Contributing factors to job completion time: latency, scalability, cloud provider overheads, overall computation structure of the job

      • Contributing factors to execution cost: price of each resource provisioned, price of data movement, provider billing model

    • How can we generate a low cost allocation plan that completes on time

    • How can we schedule said allocation to optimize worker co-location + cluster utilization?

  • Solution

    • Cost / performance model via profiling DL model training latency and provisioning overhead

    • Dag-based execution model which finds feasible and cost-efficient resource allocations

    • Full-stack system for placement, scheduling, and scaling

  • Issue 1: modeling job completion time and cost

    • Data and simulator, and taking the straggler into account

      • Performance modeling: training latency, provider queueing delay, instance initialization latency

      • Cost modeling: compute price billing granularity, data price

    • Offline profiling: 5 mins or so

    • Simulator: each potential execution plan

      • Simulate a DAG

      • Successive halving algorithm with 3 different stages

        • Account for straggler: sample the training latency for each node type from a distribution which we parameterize by the profiling data

        • How to solve the straggler issue?

  • Issue 2: finding a low cost allocation plan

    • Planner: start with a feasible solution and then the plan is iteratively improved upon

    • Steps

      • Generate candidates

      • Use simulator to predict job completion time

      • Greedily select best candidate

      • Iterate with new best candidate

    • Maximize cost-marginal benefit

      • M = (cost of current best plan - cost of proposed plan) / (JCT of proposed plan - JCT of current best plan)

      • Reduce cost without significantly increasing the job completion time

    • Question: is it portable to all kinds of HPE algorithms? accuracy is also important

  • Issue 3: effectively execute allocation plan

    • Rubberband executor: scheduler, cluster manager, placement controller

    • Scheduler requests cluster manager to provision new nodes or de-provision existing ones

    • Placement controller converts the resource quantity allocated to each trial to physical resource assignment

      • Place paralell workers of a trial onto a single machine (or packed into a minimal set of node)

      • Co-located worker, avoid network overhead


  1. Profile and simulator model job completion time + cost of potential allocations

  2. Planner generates a low cost allocation plan that completes on time

  3. Scheduler placement controller and cluster manager executes the allocation plan such that worker co-location and cluster utilization are maximized

Last updated