Faster and Cheaper Serverless Computing on Harvested Resources

https://dl.acm.org/doi/pdf/10.1145/3477132.3483580

Presentation

Serverless computing
- User's angle
  - Pay per use
  - No worry about VMs
  - Most popular offering - function as a service (FaaS)
    Bounded completion time with no persistent states between invocations
- FaaS Provider's angle
  - Provision, manage, and pay laaS providers for hosting VMs
Harvested Resources in Datacenters
- IaaS providers offer surplus resources as VMs
  - Harvest VM
    Allocated with a minimum size
    Grows and shrinks to harvest all unallocated resources
    Only evicted if minimum size is needed for a regular VM
    Grace period before eviction (e.g., 30 seconds in Azure)
Characterization - Harvest VMs
- Eviction:
  - >90% harvest VMs live longer than 1 day
  - >60% Harvest VMs live longer than 1 month
  - Harvest VM evictions are generally infrequent
- Resource variation:
  - >70% intervals longer than 10 minutes
  - ~35% intervals longer 1 hour
  - Resource variation much more frequent than evictions
Characterization - FaaS workloads
- Invocation duration & VM grace period
  - Long invocation - longer than 30s (VM eviction grace period)
  - Long application - at least 1 invocation longer than 30 seconds
  - Short invocations can always terminate in VM grace period
- 96% invocations shorter than 30 seconds
- Long invocations take over 82% of the total execution time
- Application's angle
  - 48.7% applications are long applications
  - Long applications take over 99.7% of the total execution time
  - Nearly half of the applications are long, and long applications take over almost all the execution time
Take-aways
- FaaS workloads are a good fit for Harvest VMs
  - Short invocation durations
  - Relatively long harvest VM lifetime
- Harvest VM resource variation much more common than evictions
- Long applications take the vast majority of the total execution time
Handling Evictions
- How to eliminate or minimize invocation failure caused by Harvest VM evictions?
  - Use a mix of Harvest VMs and regular VMs
  - Strategies with different tradeoff between reliability and efficiency
    No failure: guaranteed no failure
    Guarantee no invocation failure caused by Harvest VM evictions
    Assign all long applications to regular VMs
    Efficiency?
    Fraction of computation hosted by cheap Harvest VMs
    12% of computation capacity hosted by Harvest VMs
    Bounded failure
    Upper bound (100 - x)% per application eviction failure rate
    Assign application to regular VM if xth duration percentile longer than 30s
    Worst case: all long invocations on Harvest VM fails --> (100-x)% failure rate
    How about efficiency?
    Failure rate < 1% --> 45.7% computation hosted by Harvest VMs
    Failure rate < 0.1% --> 28% computation hosted by Harvest VMs
    Live and Let Die
    No guarantee on failure rate
    All applications on Harvest VMs
    How about reliability?
    Invocation failure rate
    Worst: 99.99% success rate
    7 nines of reliability
    Failure require two low-probability events to happen simultaneously
    A harvest VM gets evicted while it is running a long invocation
    Best efficiency and actual low failure rate
Handling Resource Variability
- Join-the-shortest-queue (JSQ) leads to high cold start rate
  - Invocations of the same application distributed across all VMs
  - Inter-arrival time longer than container keep-alive
- Min-worker-set (MWS)
  - Consolidates each application to a minimal set of backend
  - Shorter inter-arrival time --> warm starts
  - Consistent hashing to minimize reshuffling of home VMs
Implementation on OpenWhisk
- Harvest Monitor
  - Collects resource information & eviction signal
- Controller
  - Maintains data from Harvest Monitor
  - Implement MWS
- Resource monitor
  - Tracks resource variation in the system
  - Spins up new VMs to maintain available resources

Evaluation

Conclusion

To host serverless platforms on harvested resources
Quantify the challenges of using harvested resources for serverless invocations, including Harvest VM evictions and resource variation
Demonstrate the reliability of hosting serverless workloads on harvested resources
Demonstrate the performance and economic benefits of hosting serverless platforms on harvested resources

PreviousBladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge NextReading List

Last updated 3 years ago

Was this helpful?