Faster and Cheaper Serverless Computing on Harvested Resources

https://dl.acm.org/doi/pdf/10.1145/3477132.3483580

Presentation

  • Serverless computing

    • User's angle

      • Pay per use

      • No worry about VMs

      • Most popular offering - function as a service (FaaS)

        • Bounded completion time with no persistent states between invocations

    • FaaS Provider's angle

      • Provision, manage, and pay laaS providers for hosting VMs

  • Harvested Resources in Datacenters

    • IaaS providers offer surplus resources as VMs

      • Harvest VM

        • Allocated with a minimum size

        • Grows and shrinks to harvest all unallocated resources

        • Only evicted if minimum size is needed for a regular VM

        • Grace period before eviction (e.g., 30 seconds in Azure)

  • Characterization - Harvest VMs

    • Eviction:

      • >90% harvest VMs live longer than 1 day

      • >60% Harvest VMs live longer than 1 month

      • Harvest VM evictions are generally infrequent

    • Resource variation:

      • >70% intervals longer than 10 minutes

      • ~35% intervals longer 1 hour

      • Resource variation much more frequent than evictions

  • Characterization - FaaS workloads

    • Invocation duration & VM grace period

      • Long invocation - longer than 30s (VM eviction grace period)

      • Long application - at least 1 invocation longer than 30 seconds

      • Short invocations can always terminate in VM grace period

    • 96% invocations shorter than 30 seconds

    • Long invocations take over 82% of the total execution time

    • Application's angle

      • 48.7% applications are long applications

      • Long applications take over 99.7% of the total execution time

      • Nearly half of the applications are long, and long applications take over almost all the execution time

  • Take-aways

    • FaaS workloads are a good fit for Harvest VMs

      • Short invocation durations

      • Relatively long harvest VM lifetime

    • Harvest VM resource variation much more common than evictions

    • Long applications take the vast majority of the total execution time

  • Handling Evictions

    • How to eliminate or minimize invocation failure caused by Harvest VM evictions?

      • Use a mix of Harvest VMs and regular VMs

      • Strategies with different tradeoff between reliability and efficiency

        • No failure: guaranteed no failure

          • Guarantee no invocation failure caused by Harvest VM evictions

          • Assign all long applications to regular VMs

          • Efficiency?

            • Fraction of computation hosted by cheap Harvest VMs

            • 12% of computation capacity hosted by Harvest VMs

        • Bounded failure

          • Upper bound (100 - x)% per application eviction failure rate

          • Assign application to regular VM if xth duration percentile longer than 30s

            • Worst case: all long invocations on Harvest VM fails --> (100-x)% failure rate

          • How about efficiency?

            • Failure rate < 1% --> 45.7% computation hosted by Harvest VMs

            • Failure rate < 0.1% --> 28% computation hosted by Harvest VMs

        • Live and Let Die

          • No guarantee on failure rate

          • All applications on Harvest VMs

          • How about reliability?

            • Invocation failure rate

              • Worst: 99.99% success rate

              • 7 nines of reliability

            • Failure require two low-probability events to happen simultaneously

              • A harvest VM gets evicted while it is running a long invocation

          • Best efficiency and actual low failure rate

  • Handling Resource Variability

    • Join-the-shortest-queue (JSQ) leads to high cold start rate

      • Invocations of the same application distributed across all VMs

      • Inter-arrival time longer than container keep-alive

    • Min-worker-set (MWS)

      • Consolidates each application to a minimal set of backend

      • Shorter inter-arrival time --> warm starts

      • Consistent hashing to minimize reshuffling of home VMs

  • Implementation on OpenWhisk

    • Harvest Monitor

      • Collects resource information & eviction signal

    • Controller

      • Maintains data from Harvest Monitor

      • Implement MWS

    • Resource monitor

      • Tracks resource variation in the system

      • Spins up new VMs to maintain available resources

Evaluation

Conclusion

  • To host serverless platforms on harvested resources

  • Quantify the challenges of using harvested resources for serverless invocations, including Harvest VM evictions and resource variation

  • Demonstrate the reliability of hosting serverless workloads on harvested resources

  • Demonstrate the performance and economic benefits of hosting serverless platforms on harvested resources

Last updated