Linear Work Generation of R-MAT Graphs

R-MAT is a simple, widely used recursive model for generating `complex network' graphs with a power law degree distribution and community structure. We make R-MAT even more useful by reducing the required work per edge from logarithmic to constant. The algorithm works in an embarrassingly parallel way.


Introduction
Algorithms for processing large graphs have attracted a lot of interest in the last years since many "big data" problems can be described with this abstraction. However, one limitation in developing these algorithms is in obtaining sufficiently large inputs. Today, the largest real world networks, e.g., the Facebook graph, are not available to most researchers. When we want to anticipate future developments, no real world inputs are available at all. Hence, scalable graph generators are an important surrogate; e.g., [1,2,3]. A flagship model with respect to large graphs is the R-MAT model [1]. R-MAT is simple, models power law degree distributions, and also has a community structure that is somewhat similar to complex real world networks. R-MAT is therefore used in the Graph 500 benchmark [4] that is used to evaluate graph algorithms on the largest massively parallel machines.
The simplest variant of R-MAT defines a directed multi-graph G = (V, E) with n = 2 k nodes and m edges based on 4 parameters a, b, c, and d with a + b + c + d = 1. Each edge is drawn independently at random using the following recursive process: the adjacency matrix is split into four quadrants. The edge is placed in the upper left quadrant with probability a, in the upper right quadrant with probability b, in the lower left one with probability c and in the lower right one with probability d. The process repeats this subdivision k times until a single entry of the adjacency matrix is determined. In other words, a for-loop generates one bit of the column and row indices for the edge in each iteration. Our main result -described in Section 2 -is to explain how a logarithmic number of address bits can be generated in each iteration without changing the underlying process. Section 4 reports experiments which show that the new process is an order of magnitude faster than the most widely used R-MAT generator [4]. Refer to Section 3 for generalizations to undirected graphs, more general stochastic Kronecker graphs, bipartite graphs, simple graphs, smoothed degree distribution, massively parallel implementation, partitioned graphs, and adaptations to GPUs.

Our Algorithm
We exploit that one can sample in constant time from any probability distribution on a finite set. One uses a data structure A that can be built in time proportional to the number of elements in the set [5]. The basic idea is to precompute n α paths together with their probabilities in the recursive process described in Section 1 for some constant α < 1. By sampling and concatenating the resulting paths, we generate Θ(log n) address bits of adjacency array entries in each iteration. The simplest such strategy generates all paths of a fixed length ℓ < 0.5 log n.
We can increase the average number of generated bits by choosing paths of similar probability. For example, for a given bound n on the table size, the following code fragment produces a set of entries Q that maximizes the minimum probability of entries: --"·" stands for Q.insert(cp, i · 1, j · 0); Q.insert(dp, i · 1, j · 1) -string concatenation This pseudocode uses strings of bits for clarity where a real implementation processes machine words in a bit parallel fashion in constant time.
The resulting sequence of output items E is then preprocessed into a data structure A allowing constant time sampling. Row and column indices of adjacency array entries can then be generated as follows: i := j := -fragments of row and column index bit strings for (e := 1; e < B; ) -reuse remaining bits for e++ -next edge We use an information theoretic argument to understand why generating variable rather than fixed length fragments of row/column indices can help. Sampling from A yields ≥ log(1/p) bits of information where p is the maximum probability of an entry in A. This table will have size ≈ 1/p. Sampling a single bit of row and column index yields H = −a log a − b log b − c log c − d log d ≤ 2 bits of information -the entropy of the 4-digit alphabet with probabilities a, b, c, d. A table of size |A| = 1/p that stores equal length fragments of row/column indices can store fragments of size log(1/p)/2. This yields H log(1/p)/2 bits of information. Taking the ratio of these two numbers, we get the possible speedup through using variable length fragments log(1/p)/(H log(1/p)/2) = 2/H ≥ 1.
For example, the Graph 500 benchmark uses a = 0.57, b = c = 0.19, and d = 0.05. This distribution has entropy H = 1.59, i.e., the method with variable length entries for A needs a factor of about 1.26 times less samples from A. Higher gains are possible for more skewed parameters a-d.

Generalizations
Several generalization are quite easy. We can generate an undirected graph by mirroring edges in the lower left triangular matrix to the upper right triangular matrix. The Graph 500 benchmark scrambles row and column IDs in order to "hide" the structure of the graph. This is a simple constant time postprocessing. More general stochastic Kronecker graphs [6] can be generated by using a k × k-matrix rather than a 2 × 2-matrix for the recursion. We can generate bipartite graphs or remove duplicate edges as proposed in [1] or avoid them altogether as described below for generating partitioned graphs.
Chakrabarti [1] proposes to smooth the degree distribution of the graph by perturbing the parameters a-d in each step of the process. Note that this step is often ignored -for example in the Graph 500 benchmark. Implementing this exact approach in constant time per edge seems difficult. However, we can perturb the probabilities for the entries of A or also use several such perturbed tables to achieve a similar effect.
Parallelization. Edge generation in our algorithm is embarrassingly parallel and communication-free. Here we note that one could also achieve a classical goal of parallel algorithm theory to have not only linear work but also logarithmic time on the critical path (span) for an unbounded number of processors. This can be achieved by generating the entries of A by recursively expanding all entries with probability > p for an appropriately choosen value of p. This can be done with span log 1/ max{a,b,c,d} (1/p) and work linear in |A|. We can then use the parallel algorithm for table construction from [7] which has span O(log |A|) and linear work.
More practically relevant is that parallel graph algorithms work best if each PE is responsible for all edges incident to a given number of nodes. Sorting the edges accordingly after they were generated would destroy the communication-free property of our generator. We can get a communication-free generator for such partitioned graphs using an approach similar to the one used for Erdős-Rényi graphs in [3]. The adjacency matrix is split into a number of quadratic tiles. We use a divide-and-conquer algorithm that determines the number of edges generated in each of these tiles. Each PE only recurses on those subproblems that intersect the rows of the adjacency matrix assigned to it. The base case is a single tile of the adjacency matrix where we use the algorithm from Section 2 to generate the prescribed number of entries. Note that this makes duplicate elimination a local operation so that this feature can also be implemented in a communication-free way.
Finally, the algorithms are sufficiently simple to be easy to port to GPU. Perhaps, the single sampling loop from Section 2 would then be split into separate loops for generating random numbers, retrieving bit strings from A, and chopping the resulting long bit strings into edges.

Experiments
We perform experiments on a machine with two Intel Xeon E5-2683 v4 processors which have 16 cores each. With hyperthreading this means we can use up to 64 threads. Our implementation is written in C++ and compiled with the GNU C++ compiler g++ in version 8.2.0. The source code is available at https://github.com/lorenzhs/wrs/tree/rmat. We compare with the R-MAT generators used in the Graph 500 benchmark reference implementation and in NetworKit [8]. We removed the code for making the graph undirected and for scrambling vertex IDs in order to concentrate on the core task of R-MAT: edge generation. The modified code is also available under the above link. Figure 1 shows the throughput using 64 threads as a function of the available table size. For n = 2 30 nodes, we generate 10 11 edges overall, assigning blocks of 2 16 edges at a time to the    threads. We see two peaks in throughput which correspond to fully using L2 cache and L3 cache respectively. Overall, on the Intel machine, the best throughput is up to 881 million edges per second. 1 This is 14.2 times faster than the Graph 500 generator and 7.9 times faster than NetworKit. Adding the "clip-and-flip" technique to produce undirected graphs adds a constant cost of 5-8 ns per edge. Scrambling (to prevent node IDs from giving away information about their likely properties) adds another 40 ns per edge using our optimised implementation, which is faster than in Graph 500 . We can see that the variable length variant performs slightly worse than the fixed length variant since there is some overhead for processing variable length bit strings.
For more skewed parameters, the variable length variant comes out ahead. Setting a = 0.9, b = c = 0.025 and d = 0.05, the variable length variant achieves 1.23 · 10 9 edges per second with a table size of 2 13 entries. This is roughly 40 % more than the fixed depth variant. Note that for this choice of parameters, utilising the full L3 cache is no longer optimal: as the entropy of the alphabet is now much lower, the 2 13 entries that fit into each core's L2 cache provide 13.9 bits of information on average. Increasing the table size to 2 21 yields 15 bits per sample, which is not enough of an improvement to offset the increased latency of the L3 cache. Both of these figures compare to exactly 10 bits per sample for the best-performing fixed depth configuration, whose running time remains unchanged (it does not depend on the parameters a-d). Figure 2 gives the speedup over our algorithm on a single thread. We achieve speedup 30.6 on 32 threads and speedup 32 on 64 threads. Hyperthreading is of little help. This is because the memory bandwidth to L3 cache is becoming a limiting factor in our code. In contrast, the Graph 500 generator significantly benefits from hyperthreading since the number of arithmetic operations far dominate the memory accesses.
The speed of our generators compares favorably to generators for other models. For exam-ple, the sequential Erdős-Renyi generator of [3] is only around 10 % faster than our sequential generator, but for a much simpler model (the special case a = b = c = d). On the other hand, the streaming hyperbolic graph generator sRHG of [3], a more complicated model, takes around 5 times more time per edge.

Conclusions
We give a simple and practical algorithm for generating R-MAT graphs with constant time per edge in an embarrassingly parallel way. This makes this widely used family of graphs even more easily usable for experiments with huge graphs. We improve sequential throughput by an order of magnitude over the basic algorithm by precomputing chains consisting of many random decisions and storing them in an appropriate discrete distribution. The same approach is likely to work in other settings too.