Distributed graph analysis has always been a hard task: when you need to get data about links of a nearby node, you often have to transfer data between computers, which increases execution time and puts more load on network infrastructure.
In this talk, we'll discuss several tricks we use at OK for working with big graphs, especially friendship graph including more than 13 billions of links. You can accelerate processing exponentially by exploiting graph symmetry, but you can achieve even more by using probabilistic data structures.
At OK, the main platform for distributed analytics, including graph analytics, is Apache Spark, so the talk will be illustrated with examples of code for this platform.
This talk will be useful for data scientists and data engineers.
Download presentation