MOSAIC: Processing a Trillion-Edge Graph on a Single Machine

https://dl.acm.org/doi/pdf/10.1145/3064176.3064191

  • Limit by disk size not memory

  • Out-of-core

  • Trillion-edge graph (8TB overall)

  • Large-scale graph processing is ubiquitous

    • Social network, genome analysis, graphs enable machine learning

  • Powerful, heterogeneous machines

    • Terabytes of RAM on multiple sockets

    • Powerful many-core coprocessors (Xeon Phi processor)

      • Specific? Or applicable to GPU as well?

    • Fast, large-capacity non-volatile memory

    • Goal: take advantages of heterogeneous machine to process tera-scale graphs

Graph Processing: Sample Application

  • Community Detection

  • Find Common Friends

  • Find Shortest Paths

  • Estimate Impact of Vertices (webpages, users, ...)

  • ...

Design

  • Design space

    • Single Machine

      • Out-of-core: cheap, but potentially slow

      • In memory: fast, but limited graph size

    • Cluster

      • Out-of-core: large graphs, but expensive & slow

      • In memory: large graphs, fast, but expensive

  • Single machine, out-of-core: cost-effective

    • 10? higher memory more expensive? What is the ratio? Any evaluation?

    • Validate this

  • Goal: good performance and large graphs

  • Goal: Run algorithms on large graphs on a single machine using coprocessors

    • Enabled by

      • Common, familiar API (vertex / edge-centric)

      • Encoding: Lossless compression

      • Cache locality

      • Processing on isolated subgraphs

        • Divided into small subgraphs

        • Do a massively parallel using cores and hyper-threads

  • Xeon: host CPU

    • 8 GB of high bandwidth memory

  • Access pattern

  • Subgraphs: tiles

    • At most 2^16 vertices

    • Local identifier: 2 bytes + level of indirection

  • Cache locality

    • Inside subgraphs: sort by access order

    • Between subgraphs: overlap vertex sets

  • Produce the tiles

Evaluation

  • Re-run the preprocessing step

  • Generate incrementally?

    • No

  • Not supported multiple combines

  • Execution model (when they place the tiles, inter-core communication? Or independent?)

Conclusion

  • Same gain to put in the multi-core CPUs?

    • Use the principle

    • Spread computation

    • Hybrid systems?

      • Taking benefits on tiling

      • Along with offloading to GPUs

Think about:

  • Workloads, technologies, trends. Design something that benefits the problem they have.

Last updated