Reliability of relational event model estimates under sampling: how to fit a relational event model to 360 million dyadic events

We assess the reliability of relational event model parameters estimated under two sampling schemes: (1) uniform sampling from the observed events and (2) case-control sampling which samples non-events, or null dyads ("controls"), from a suitably defined risk set. We experimentally determine the variability of estimated parameters as a function of the number of sampled events and controls per event, respectively. Results suggest that relational event models can be reliably fitted to networks with more than 12 million nodes connected by more than 360 million dyadic events by analyzing a sample of some tens of thousands of events and a small number of controls per event. Using data that we collected on the Wikipedia editing network, we illustrate how network effects commonly included in empirical studies based on relational event models need widely different sample sizes to be estimated reliably. For our analysis we use an open-source software which implements the two sampling schemes, allowing analysts to fit and analyze relational event models to the same or other data that may be collected in different empirical settings, varying sample parameters or model specification.


Introduction
Relational event networks arise naturally from directed social interaction. Examples include communication networks (Monge and Contractor 2003), teams (Leenders et al. 2016), online peer-production (Lerner and Tirole 2001;Lerner and Lomi 2017), international relations (Lerner et al. 2013), and discourse networks (Leifeld 2016;Brandenberger 2019). In all these cases, social interaction reflects, and at the same time shapes, relations among social actors and affects individual sentiments -such as trust, esteem, like, or dislike -and their contextual behavioral expressions (Stadtfeld and Block 2017).
Relational events resulting from social interaction in open populations are often more directly and accurately accessible when interaction is mediated by technology, rather than embedded in face-to-face social relations. Data on relational event networks are increasingly accessible through social media platforms such as Twitter (Dodds et al. 2011) or Facebook (Golder et al. 2007). Advances in automatic data collection and coding strategies recording interaction at fine time-granularity are also encouraging the diffusion of relational event models (Lazer et al. 2009). Relational events have been suggested as the appropriate unit of analysis, for instance, in the study of team processes (Leenders et al. 2016), and interorganizational relations (Vu et al. 2017), where aggregating interaction over more or less arbitrary time intervals continues to represent the dominant research design (Pallotti and Lomi 2011).
Building on the path-breaking work of Butts (2008), empirical studies adopting relational event models (REM) are becoming more common (Vu et al. 2017). Practical applicability of REM, however, has been is representative for a larger network. In the given example it seems rather plausible that articles about migration-related topics are not representative for all Wikipedia articles. Second, if we restrict the analysis to sub-networks, the computation of explanatory variables is potentially incorrect. In the example from Lerner and Lomi (2019a), Wikipedia users might contribute to articles that are not in the chosen subset so that, for instance, the out-degree, in-degree, or four-cycle statistics, see Eqs. (8) to (10), might be computed incorrectly. In brief, restricting the analysis to sub-networks needs a strong assumption about the modularity or quasi-decomposability of the relational system of interest (Newman 2006;Simon 1996Simon , 1977. In this paper we follow a different analytical strategy: even though we compute explanatory variables only for a random subset of events, all events are aggregated into the network of past events (see Sect. 3.3) which ensures that all computed explanatory variables are correct.
We contribute to the practical applicability of REM in empirical network research by: 1. Providing a proof-of-concept that relational event models can be fitted to large event networks with millions of nodes and hundreds of millions of events by analyzing a sample of some tens of thousands of events and a small number of controls per event. For some network effects it is even sufficient to analyze a sample comprising a few hundreds of events.
2. Assessing the variation of estimated parameters as a function of the number of sampled events and the number of controls per event, separately for different commonly-used network effects. This allows us to determine how many events and controls are needed for reliable estimation of the different effects and sheds light on the question whether we should sample few events and many controls per event or rather many events and fewer controls 3. Identifying characteristics in the data that explain why some network effects need many more observations than others to be reliably estimated and discussing a different sampling strategy that might be applied in situations of badly distributed explanatory variables.

4.
Introducing an open-source software with which similar parameter estimation and sensitivity tests can be performed using different sample parameters, different models, or different data.
The remainder of this paper is structured as follows. Section 2 recalls the basic framework for relational event models and introduces the two sampling strategies. In Sect. 3 we introduce the setting, data, empirical model specification, and the experimental design. We provide detailed results in Sect. 4 and summarize their implications in Sect. 5 where we also offer recommendations for the practice of estimating REMs and refer to an open-source software that may be adopted to replicate the results of the study in the same or different settings. We conclude with a general discussion of the implications of our work for the practice of estimating relational event models in empirical studies. The majority of the numerical results are included in an appendix.

Relational event models
A general statistical modeling framework capable of exploiting the fine-grained time information inherent in relational event sequences has been proposed by Butts (2008). Relational event models (REM) specify time-varying event rates for all dyads as a function of past events on the same or other surrounding dyads. Variations of this model framework include REM for weighted events (Brandes et al. 2009;Lerner and Lomi 2017), models based on marked point-processes (Amati et al. 2018), and actor-oriented REMs (Stadtfeld and Block 2017).
In general, relational event models (Butts 2008) specify a likelihood for sequences of relational events E = (e 1 , . . . , e N ), where each event e i has the form e i = (u i , v i , t i ) .
In the notation above t i is the time of the event, u i ∈ U ti is the source, sender, or initiator of the event, taken from a set U ti of possible source nodes at the event time, and v i ∈ V ti is the target, receiver, or addressee of the event, taken from a set V ti of possible target nodes at the event time. For each time point t the risk set R t ⊆ U t × V t is the set of dyads on which an event could potentially happen at t. In particular, it holds that (u i , v i ) ∈ R ti . (Note that Butts (2008) denoted by "support set" what we call "risk set"; we adopt the latter term, using the symbol R, since it is commonly applied in survival analysis.) For a time point t, the network of past events G[E; t] is a function of events that happened strictly before t (and potentially of exogenous covariates). The network of past events shapes the distribution of events at t as we explain in greater detail and in the specific empirical setting of this paper in Sect. 3.3.
For a time point t and a dyad (u, v) ∈ R t let T denote the random variable of the time of the next event on (u, v), that is the first event e = (u , v , t ) with u = u, v = v and t ≥ t. The hazard rate (also intensity or event rate) on (u, v) at t is defined by ∆t .
Under rather weak assumptions, the hazard rate λ (u, v, t) can be interpreted as the expected number of events in the dyad (u, v) in a time interval of length one (Lawless 2003). In empirical studies one is typically interested in factors that tend to increase or decrease the dyadic event rate.
In the Cox proportional hazard model (Cox 1972) -corresponding to the REM with ordinal times in Butts (2008) -the hazard is specified, conditional on the network of past events G[E; t] and dependent on parameters θ ∈ R k as λ (u, v, t, G[E; t]; θ) = λ 0 (t) · λ 1 (u, v, t, G[E; t]; θ) , where (1) In the equation above, λ 0 is a time-varying baseline hazard function for all dyads in the risk set and the relative event rate λ 1 (u, v, t, G[E; t]; θ) -a function of parameters θ and sufficient statistics s (u, v, G[E; t]) -is proportional to the probability that an event at t happens on (u, v), rather than on any other dyad in the risk set.
In the Cox proportional hazard model the partial likelihood based on the observed event sequence E is and parameters θ are estimated to maximize L.

Case-control sampling
The problem in maximizing the likelihood is the runtime for evaluating the denominator in each of the factors of Eq. (3). In our empirical case, the risk set at the end of the observation period has more then 30 trillion elements. Thus, we cannot even evaluate the term for the single last event -much less the terms for all 360 million events. A possible solution has been suggested by Vu et al. (2015) who applied case-control sampling . In case-control sampling one analyzes all "cases" (dyads experiencing an event at the given point in time) but only a small random subset of the "controls" (dyads in the risk set not experiencing an event at the given point in time). (In this paper we will also speak of "events" instead of "cases" and of "non-events" instead of "controls".) Let e i = (u i , v i , t i ) be the i'th event in E and let m be a positive integer, giving the number of controls per event. In the sampled likelihood the term R ti in the denominator of Eq. (3) is replaced by SR i (m) ⊆ R ti which is drawn uniformly at random from . . . risk sets at sampled event times  Observed events (left-most column) are sampled independently with probability p; sampled observed events are indicated by black squares. For each sampled event e i , we sample m dyads (controls) from the risk set R ti ; sampled controls are indicated by grey squares. Note that all observed events happening before time t (not just the sampled events) are used to maintain the network of past events G[E; t] and, thus, to compute the sufficient statistics s in the specification of the event rate λ 1 in Eq. (2).
From another point of view, the set SR i (m) contains the dyad (u i , v i ), that is the one in which event e i happens, and additional m elements (the "controls"), different from (u i , v i ), randomly drawn from R ti without replacement. For a given sequence of sampled case-control sets SR(m) = (SR 1 (m), . . . , SR N (m)), the sampled likelihood is defined by . (4) The procedure to sample from the risk set described above differs in several aspects from Markov chain Monte Carlo (MCMC) sampling as it is typically employed in the estimation of exponential random graph models (ERGM) (Hunter et al. 2008). One of the most crucial differences is that the risk set of a relational event model grows linearly in the number of dyads (that is, quadratic in the number of nodes), while the space of an ERGM grows exponentially in the number of dyads.

Sampling observed events
Another sampling strategy that we consider in this paper additionally to case-control sampling involves uniform sampling from the set of observed events. Concretely, for a real number p with 0 < p ≤ 1, let SE(p) ⊆ E be a random subset of observed events, obtained by including each e ∈ E independently at random with probability p. Additionally, let SR(m) be a sequence of sampled case-control sets as described above. The sampled likelihood is given by: and parameters θ are estimated to maximize the sampled likelihood L [SR(m),SE(p)] (θ). Both sampling schemes, case-control sampling and uniform sampling from the observed events, are illustrated in Fig. 1. We emphasize that the networks of past events G[E; t i ] in Eq. (5) are functions of all events in E that happen before t i , not just functions of the sampled events in SE(p). Thus, for a given dyad (u, v) the explanatory variables (sufficient statistics) s (u, v, G[E, t]) from Eq. (2) are computed correctly. This points to a crucial difference to the approach that analyzes sub-networks as discussed in the introduction. Moreover, it points to a crucial difference to situations where events are missing (rather than left out due to sampling); obviously, missing events cannot be used to compute statistics.
Using the sampled likelihood for estimation of model parameters introduces another source of uncertainty: if we repeated the sampling of events and controls, we obtained a different sampled likelihood and thus, probably, also different parameter estimates. In our experiments, which are explained in detail in Sect. 3.5, we want to determine how the variation in the model parameters θ introduced by sampling depends on the sampling parameters p (the probability to sample an observed event) and m (the number of sampled controls per sampled event). In this paper we call p and m the sample parameters and θ the model parametersunless it is clear from context which set of parameters is meant.

Asymptotic properties
An attractive property of case-control sampling in the Cox proportional hazard model is that it yields an estimator which is consistent and asymptotically normal . This is a non-trivial result since case-control sampling is a sampling strategy that actually manipulates the ratio of events to non-events. These asymptotic properties are reassuring -but they do not diminish the relevance of the experimental approach of this paper.
In our experiments we assess the variation in the parameters caused by sampling for different sample sizes. The experimental results we present suggest that for some of the most common network effects typically included in empirical model specifications (Lomi et al. 2014), variation caused by sampling is much larger than the estimated standard errors of model parameters, even for relatively large sample sizes. It seems difficult to determine the sufficient sample size by purely theoretical considerations based on asymptotic properties -which is why we consider experimental evaluation as necessary.

Empirical setting: Network analysis of collaboration in Wikipedia
Wikipedia is a prime example of a self-organizing online production community in which voluntary users invest personal time and effort to provide a public good (Lerner and Tirole 2001;Lerner and Lomi 2017). The most crucial resources that articles need to garner in order to attain a high level of quality are the attention and contribution of users (Lerner and Lomi 2018b). This observation motivates a study of the factors that induce particular users to contribute more or less to particular articles. The network of relational events that we consider in this paper includes all events in which any registered user contributes to any article in the English-language edition of Wikipedia. The estimated model reveals basic network effects that explain dyadic contribution rates on all user-article pairs.
Previous work analyzing user contributions in Wikipedia mostly aggregates edits over users, articles, or time. For instance, Moat et al. (2013) analyze patterns in the dynamics of edits received by articles -but do not model dyadic event rates, nor do they analyze network effects explaining edit frequencies. Yasseri et al. (2012) analyze periodic patterns in the global edit activity in different language editions of Wikipedia. Keegan et al. (2012) fit exponential random graph models to the two-mode network connecting Wikipedia users to "breaking-news articles". This model framework, however, requires to aggregate relational events per dyad over time.
Surprisingly -considering that Wikipedia contribution data are naturally provided as time-stamped relational events -studies analyzing Wikipedia collaboration networks as relational event systems are still relatively infrequent. Perhaps this is due to the lack of methods that are able to estimate REM for large event networks. Lerner and Lomi (2017, 2018a, 2019b propose and estimate REMs for the edit networks of single articles. Their model does not explain user activity rates but rather specifies dyadic probabilities to undo or redo contributions of other target users, given that a source user becomes active on the focal article. Previous work that seems to be the closest to this paper is a recent article of Lerner and Lomi (2019a) which analyzes a REM for edit and talk events on Wikipedia articles about migration-related topics (also discussed in the introduction). Thus, Lerner and Lomi (2019a) analyze a thematically delineated sub-network. They apply case-control sampling but do not sample from the observed events and do not experimentally vary sample sizes. Data on the Wikipedia editing network provide a particularly valuable setting to perform the experiments that we propose in this paper because of their size, accuracy and complexity, and because of the substantive interest in Wikipedia as one of the largest and most successful examples of peer production projects (Jan Piskorski and Gorbatâi 2017; Greenstein and Zhu 2012).

Data
The event network that we consider in this paper is given by the sequence of relational events in which any registered and logged-in Wikipedia user uploads a new version of any article in the English-language edition of Wikipedia in the time period from January 15th 2001 (the launch of Wikipedia) to January 1st, 2018 (the time of data collection). For a point in time t we define U t to be the set of all users who edited at least one article at or before t and we define A t to be the set of all articles that received at least one edit at or before t. The risk set at time t is defined to be the full Cartesian product of users and articles R t = U t × A t . It is possible to define the risk set in more sophisticated ways: users or articles could enter or leave the risk set as a function of events, due to extended periods of inactivity, or due to exogenously given entry-times or exit times. Such more complex risk set dynamics, however, are not considered in this paper. (We performed robustness checks by letting users and articles drop out of the risk set due to extended periods of inactivity; results were not affected in any meaningful way.) Wikipedia makes its complete database available for public use. 1 All edit events can conveniently be extracted from the files enwiki-<date>-stub-meta-history.xml.gz which contain meta-information (including user, article, and time) of every edit but not the text of the different versions (which we do not need for the analysis in this paper).
The time of edits in the Wikipedia data is accurate to the second. The Cox proportional hazard model does not make use of the precise timing of events. The given timestamps of events are used to determine the decay of past events, as explained in Sect. 3.3, and the order of events, but have no further influence on the model. In the given data several events can occur in the same second of time. We nevertheless imposed a strict order on all events. If two events happen simultaneously, then the order in which they appear in the input file was taken as the given order. Two non-simultaneous events are ordered by their timestamps. We believe that the decision to enforce a strict order has no significant impact on the results (actually it is very unlikely to select two simultaneous events in the same sample of events).
The data consists of 361, 769, 741 dyadic events of the form e = (u, a, t) indicating that user u ∈ U t edits article a ∈ A t at time t. At the end of the observation period U t contains 6, 701, 379 users and A t contains 5, 542, 465 articles. Thus, at the end of the observation period the risk set contains 6, 701, 379 · 5, 542, 465 ≈ 3.714 · 10 13 (approximately 37 trillion) user-article pairs. For reproducibility and for facilitating similar research, the preprocessed data is available at https://doi.org/10.5281/zenodo.1626323.

Model
We specify the relative event rates λ 1 (u, v, t, G[E; t]; θ) on all user-article pairs (u, a) ∈ R t at time t as a function of past events that happened strictly before t, see Eq.
(2). In doing so, we weight events that happened further in the past weaker than events that happened more recently.
More precisely, following the model proposed in Brandes et al. (2009), we define the network of past events G[E; t] at time t as the weighted two-mode network with node sets U t and A t where the weight w(u, a, t) on any (u, a) ∈ U t × A t is a function of past events happening in (u, a) before t, that is, the weight is a function of the events given by the formula for a given halflife T 1/2 > 0. We set T 1/2 to 30 days so that an event counts as (close to) one in the very next instant of time, it counts as 1/2 one month later, it counts as 1/4 two months after the event, and so on. To reduce memory consumption needed to store the network of past events, we set a dyadic weight to zero if its value drops below 0.01. If a single event occurred in some dyad this would happen after 6.64 · T 1/2 , that is after more than half a year. The weight w(u, a, t) can be interpreted as the recent event intensity in the dyad (u, a) before t. Edges in the network of past events are implicitly defined to be all dyads with non-zero weights. As a robustness check we repeated some of our analyses in a model without any decay (not reported in this paper); in general, findings did not change in any meaningful way.
(2), is specified using the following five sufficient statistics.
The statistic repetition(u, a, G[E; t]) characterizes the recent event intensity in the same dyad (u, a). A positive parameter associated with this statistic would reveal that users are more likely to edit articles they have recently edited before. The popularity statistic aggregates the recent event intensities in all dyads with target a, that is, it gives the recent intensity by which the article a is edited. A positive parameter associated with the popularity statistic would reveal that popular articles (those, that recently received many edits by any user) are likely to receive edits (by potentially different users) at a higher rate in the future. The activity statistic aggregates the recent event intensities in all dyads with source u, that is, it gives the recent cumulative edit activity of user u. A positive parameter associated with the activity statistic would reveal that users who have recently been more active will also edit (potentially different articles) at a higher rate in the future. The four-cycle effect models local clustering. If users u and u recently collaborated in writting an article a and, in addition, user u recently contributed to article a, then a positive parameter associated with f our.cycle (u, a, G[E; t]) would reveal that u has a tendency to edit a at a higher rate. The assortativity statistic models the interaction between user activity and article popularity. A positive parameter would reveal that the more active users are rather drawn to editing the more popular articles while a negative assortativity parameter would reveal that the more active users are rather drawn to editing the less popular articles -where these relations have to be understood relative to the effects of the basic variables activity and popularity.
These five effects are very common in relational event modeling and are not particular to the specific setting that we examine -except that when analyzing a one-mode network, instead of a two-mode network, we would typically include a three-cycle effect, perhaps instead of the four-cycle effect. These effects are interesting for our current study since they assume dependence of dyadic observations on previous observations in varying levels of complexity. Repetition assumes that the rate of future events in (u, a) depends just on previous events in the same dyad. The popularity and activity effects assume that the rate of future events in (u, a) depends on previous events that have the same target article a (for popularity) or the same source user u (for activity). The four-cycle effect assumes that the rate of future events in (u, a) depends on previous events in a three-path of the form (u, a ), (a , u ), (u , a) for any user-article pair (u , a ) -that are different from u and a, respectively -provided that there was at least one recent event in all three dyads (u, a ), (u , a ), and (u , a). Assortativity is an interaction effect modeling that activity of a user u has a different impact for editing popular articles than for editing less popular articles and vice versa. Considering these five effects in our study is interesting since it is plausible that the different effects need different sample sizes to be reliably estimated. Due to skewed distributions we apply the mapping x → log(1 + x) to the four statistics repetition, popularity, activity, and f our.cycle. The interaction effect assortativity is defined to be the product of popularity and activity after applying this logarithmic transformation. Table 1 provides information on correlation, variance, and covariance of the five statistics computed on the aggregated observations from 100 samples with sample parameters p = 10 −4 and m = 5 (see Sects. 3.5 and 4.2). All correlations are positive and moderately high.

Runtime complexity
The total time for computing all explanatory variables is the computation time for maintaining the network of past events plus the time needed to compute the statistics for all sampled observations, that is for sampled events and sampled controls. Below we briefly discuss that maintenance of the network of past events has a much smaller asymptotic runtime per event than the computation of statistics per dyad. This is an important property of our approach since we use all events to maintain the network of past events but we compute statistics only on a sample of observations.
The network of past events can be updated in constant time per event, so that the total cost in maintaining the network of past events is proportional to the total number of events |E|. This might seem surprising since the weights on all edges change due to decay whenever time increases, see Eq. (6) -which actually happens after most input events. We achieve the linear runtime by a "lazy" strategy: we do not perform the time-decay when time increases; instead we store for each edge, additionally to its weight, the time of the last weight-update. The weight on a dyad (u, a) is only updated if either an event happens in (u, a) or if the value w(u, a, t) has to be accessed to compute statistics for (u, a) or surrounding dyads.
The cost for computing the statistics varies. The values of repetition, popularity, activity, and assortativity for a given dyad at a given point in time can be computed in constant time. In order to achieve this for the degree statistics (popularity and activity) we maintain weighted in-degrees of articles and weighted out-degrees of users explicitly as functions defined on nodes in the network of past events. (The cost for maintaining these node-weights is also linear in the number of events.) The time for computing the four-cycle statistic of a dyad (u, a) is bounded by the product of the unweighted degrees (that is, the number of incident edges) of u and a. If the unweighted degrees of all nodes were equal to d, then the runtime for computing the four-cycle statistic for one dyad would be bounded by d 2 .
In summary, it is much more costly to compute statistics for one observation than to update the network of past events as a result of one event. This is the rationale for our approach to use all events for maintaining the network of past events but only a fraction of events to compute the likelihood defined in Eq. (5). As discussed before, taking all events to update the network of past events ensures that the computed statistics for each sampled observation are correct.

Experimental design
The general goal of the computational experiments we report in the next section, is to determine how the distributions of estimated model parameters θ -and their associated standard errors, z-values, and significance levels -depend on the sample parameters p and m. We first conducted a preliminary exploratory analysis to identify sample parameters p 0 and m 0 that are large enough to reliably assess the effects of all five statistics. It turned out that the different statistics require largely different numbers of observations to be reliably estimated (discussed in more detail in Sect. 4). We have chosen p 0 and m 0 such that at least the direction of all five effects, that is, whether they tend to increase or decrease the event rate, can be assessed with high confidence. After the preliminary analysis we set these initial sample parameters to p 0 = 0.0001 = 10 −4 and m 0 = 5 resulting in approximately 36, 000 sampled events and about 5 · 36, 000 sampled controls, that is about 216, 000 sampled observations. Then we performed four types of experiments to assess how the distributions of estimated model parameters depend on the sample parameters.
• (fixed p and m) We repeat the sampling one hundred times with fixed sample parameters p = p 0 and m = m 0 . Doing so allows us to assess for each of the five effects the variation of model parameters, standard errors, and z-values caused by sampling.
• (varying p for fixed m) For fixed m = m 0 we vary p as in (p 0 /2 i ) i=1,...,10 . That is, for i from one to ten we halve, starting from p 0 , the probability to sample events in each step. This results in approximately 18, 000 sampled events for i = 1 and only about 35 sampled events for i = 10. The number of sampled controls is five times the number of sampled events, so that the number of sampled observations ranges from 108, 000 to 211. For each i = 1, . . . , 10 we repeat the sampling ten times to assess the distribution of estimated model parameters for the five effects. In this type of experiment we assess which sample sizes are sufficient to reliably estimate parameters for the various effects.
• (varying m for fixed p) For fixed p = p 0 /10 = 10 −5 we vary m as in the sequence (2 i ) i=0,...,7 . This results in one control per event for i = 0 and 128 controls per event for i = 7. The expected number of sampled events is fixed to about 3, 600 so that the number of sampled observations ranges from 7, 200 (for i = 0) to 464, 400 (for i = 7). Thus, for i = 7 we sample more than twice as many observations as for the initial sample parameters p 0 and m 0 -but the sampled observations contain much fewer events and a much higher ratio of controls per event. For each i = 0, . . . , 7 we repeat the sampling ten times.
In this type of experiment we assess the benefit of using more controls per event for a given limited number of events.
• (varying p and m for a fixed budget of observations) Similar to the experiment type above, we vary m as in the sequence (2 i ) i=1,...,8 . However, in this type of experiment we let p change as a function of m to fix the number of sampled observations to that determined by p 0 and m 0 . More precisely, for i = 1, . . . , 8 we set m = 2 i and p = 6 · 10 −4 /(2 i + 1). The expected number of sampled observations is, thus, fixed at about 216, 000. The number of controls per event ranges from 2 to 256 and the expected number of sampled events ranges from about 72, 000 (for i = 1) to about 840 (for i = 8). For each i = 1, . . . , 8 we repeat the sampling ten times. In this type of experiment we assume that we have a given budget of observations (for instance, constrained by runtime considerations) and assess whether it is preferable to include more events (and fewer controls per event), or the other way round, or if there is an optimal intermediate number of controls per event.
After conducting these experiments and analyzing the results we perform additional analyses to identify reasons why the estimation of the model parameter for the repetition effect needs so many more observations than for the other effects.

Results
In Sect. 4.1 we present and discuss estimated parameters on a single sample obtained with p = p 0 and m = m 0 . Sections 4.2 to 4.5 present and discuss results of the four types of experiments described above. Section 4.6 conducts additional analyses to clarify findings about the repetition effect. Table 2 reports estimated model parameters and standard errors of a single sample obtained with the initial sample parameters p 0 = 10 −4 and m 0 = 5 resulting in about 36, 000 events and 216, 000 observations. All parameters are significantly different from zero at the 0.1% level, point in the expected direction, and are consistent with results reported in previous studies based on a much smaller sub-network (Lerner and Lomi 2019a). The parameter associated with repetition is positive indicating that a user u typically is more likely to edit an article if she has recently edited it before. We find a positive popularity effect: articles that recently received more edits from any user are likely to be edited at a higher rate by potentially different users. Considering that contributions by users are a crucial resource for articles (Lerner and Lomi 2018b), this "rich-get-richer" effect (Barabási and Albert 1999) might actually have adverse implications for the development and quality of Wikipedia. Indeed, if contributing users tend to crowd together on a few popular articles, other articles are in danger of being neglected. We also find a positive activity effect: users that recently contributed more frequently to any article are likely to edit potentially different articles at a higher rate. The positive parameter of the four-cycle effect points to local clustering, where the clusters might be interpreted as latent topics of articles, or interests of users, respectively. If the interpretation via user interests and article topics applies, the four-cycle effect can be understood as follows. If two users u and u recently co-edited an article a , then they are likely to have similar interests. If, additionally u recently edited another article a, then a and a are likely to involve similar topics. In turn user u is also likely to edit article a at a higher rate. Assortativity, that is the interaction effect between the activity of users and the popularity of articles, is found to be negative. Thus, while users in general have the tendency to edit already popular articles, this behavior is less pronounced for the more active users. Thus, the more active users seem to dedicate a bigger share of their work (than the less active users) to the less popular articles. In the light of the discussion about the popularity effect above, this negative assortativity might be very important for Wikipedia, since it gives an indication that the more active users might be the ones who start editing less popular articles -which would otherwise be in danger of being neglected.

Model estimation on a single sample
Having established the consistency of the results with prior studies (Lerner and Lomi 2019a), and the contextual interpretability of the empirical estimates, we focus in the next sections on the variation in the estimates associated with different sample parameters. Table 3 reports summary statistics of estimated parameters, standard errors, and z-values for the five effects over these 100 samples. The boxplots in Fig. 2 display the summary statistics for the estimated parameters.

Variation caused by sampling (fixed sample parameters)
It is encouraging that all estimations over 100 samples confirm the qualitative findings from Table 2. The parameters for the repetition, popularity, activity, and four-cycle effects are positive in all 100 samples  Figure 2: Boxplots (different y-axes) of parameter estimates for the five effects over 100 samples with p = 0.0001 and m = 5. and the parameters for the assortativity effect are negative in all samples. These findings are also highly significant in all samples. The minimum absolute value of a z-value (that is, the parameter divided by its standard error) over all 100 samples and all five effects is 3.44 (for repetition) -still pointing to an effect that is significant at the 0.1% level. Thus, the conclusions about the effects in the Wikipedia collaboration network could have been obtained by analyzing any of the 100 samples and coincide with the effects discussed in Sect. 4.1. Differences in the reliability of estimates for the different effects, however, become apparent once we focus on the size of parameters. Variability of parameters over the different samples shown in Table 3 (also compare Fig. 2) varies between effects. The parameters for the popularity and activity effects -and to a lesser extent for four-cycle and assortativity -show relatively small variations as do their associated standard errors and z-values. For instance, for the popularity effect the interquartile range of parameters is [1.27, 1.32] and the minimum and maximum parameters over 100 samples span the interval [1.20, 1.39]. Thus, had we analyzed just one of the samples, then with 50% probability we would have obtained a parameter value from the interval [1.27, 1.32], with 25% probability a value from [1.20, 1.27], and with 25% probability a value from [1.32, 1.39]. Conclusions drawn from any of these results would hardly vary in a remarkable way. Parameters for the activity effect show about the same variability and the variability of the parameters for the assortativity and four-cycle effects are somewhat higher but would still not affect conclusions in any meaningful way. By far the highest variation can be found for the repetition effect where parameters range from 9.38 to 86.73 with an interquartile range of [13.40, 66.38]. We also find that the mean over the repetition parameter is considerably higher than the median pointing to a right-skewed distribution (also compare Fig. 2). Standard errors and z-values for the repetition effect also show high variations. Thus, while the qualitative finding of a significantly positive repetition effect does not depend on the specific sample, the magnitude of the parameter strongly varies over samples, at least for the given sample parameters p 0 and m 0 . We will tackle in Sect. 4.6 why exactly the repetition effect displays such a different behavior.
A further insight that we can draw from Table 3 is that for those effects that have a low variation over samples (that is, popularity, activity, assortativity, and to a lesser extent the four-cycle effect) the standard deviation 2 of the model parameters over samples and the mean standard errors of the model parameters are quite similar. For these four effects, variation of standard errors is also rather small. Together, these findings seem to suggest that, if variation over samples is small, then the standard deviation of parameters caused by sampling can be approximated by the typical standard error -a result that is consistent with asymptotic properties. This is a convenient result since standard errors can be estimated from a single sample. However, checking the relation between standard deviation of parameters over samples and mean standard error for the repetition effect, we find a much larger difference. The standard deviation of the repetition parameter over the 100 samples is 27.00 -much higher than the mean standard error (6.42), or the median standard error (2.91), and even higher than the maximum standard error over all 100 samples (18.18). Thus, judging variation in model parameters caused by sampling from the standard error of a single sample would be a poor approximation for the repetition effect.

Varying the number of events
In a second family of experiments we fix the number of controls per event to m = 5 and let the probability p to sample from events vary from about 10 −7 to 0.5 · 10 −4 where we double p in each step (compare Sect. 3.5). For each of these values of p we draw ten samples. Model parameters for the five effects estimated for the different sample parameter settings are displayed in Figure 3; summary statistics of estimated parameters, standard errors, and z-values for the five effects are in the Appendix in Tables 5 to 9. Typically, variability of parameters decreases when the number of sampled events increases -as it could have been expected. An exception to this rule is the repetition effect which we discuss last.
For the popularity effect (see Table 6) we find that only the two smallest sample sizes give unreliable (but also insignificant) results. We never estimated a parameter that is significantly negative at the 5% level  Figure 3: Boxplots of parameter estimates over 10 samples with varying p (as given in the x-axis) and m = 5. We discard boxplots if their dispersion causes a scaling of the y-axis that would make the less dispersed boxplots unreadable. (All results are given in the Appendix in Tables 5 to 9.) (corresponding to a z-value that is at least 1.96 in absolute value). For values of p = 0.391 · 10 −6 -which corresponds to as little as 140 events (and five times as many controls) -or higher we estimate only positive parameters, although not all estimates are significant. For values of p = 0.156 · 10 −5 -which corresponds to 561 events -or higher we estimate only parameters for the popularity effect which are significantly positive at the 5% level. The mean of estimated standard errors for p = 0.156 · 10 −5 is already approximately equal to the standard deviation of parameters over the different samples. For higher values of p they become even closer and both tend to zero. In summary, to reliably estimate the parameter of the popularity effect a few hundred events already seem to be sufficient. Findings for the activity effect (see Table 7) are similar to those of popularity. All estimated parameters for user activity are positive (though not always significant), even for the smallest sample comprising about 35 events. All parameters are significantly positive at the 5%-level, starting from p = 0.781 · 10 −6 resulting in about 281 events and five times as many controls. The standard deviation of estimated parameters is close to the mean of the standard errors for all but the two smallest sample sizes and tends to zero when the sample size increases.
Reliable estimation of the parameter for the four-cycle effect (see Table 8) requires more observations than for the degree effects popularity and activity. For a sample probability ranging from p = 0.977 · 10 −7 to p = 0.156 · 10 −5 (corresponding to about 35 to 561 events) some of the estimated four-cycle parameters are negative (in contrast to the more reliable finding reported in Table 2) -although the negative parameters are never significant at the 5%-level. Starting from p = 0.313 · 10 −5 all estimates are positive and for p = 0.25·10 −4 (corresponding to about 9, 000 sampled events), or higher, all estimated four-cycle parameters are significantly positive.
Findings for the assortativity effect (see Table 9) are similar to those for the four-cycle effect (keeping in mind that there is a negative assortativity effect according to the findings reported in Table 2). For p up to 0.313 · 10 −5 we estimate some positive parameters for the assortativity effect -which, however are never significant at the 5%-level. For p = 0.5 · 10 −4 (corresponding to about 18, 000 events) all estimated assortativity parameters are significantly negative. For both effects, four-cycle and assortativity, the identity between the standard deviation of the parameters and the mean standard error is approximately valid once all parameters are significant and these values tend to zero when the sample size increases.
Findings for the repetition effect (see Table 5) are different. For this effect we observe that the standard deviation of the estimated parameters (that is, the uncertainty caused by sampling) does not decrease when we increase the number of sampled events from 35 to 18,000. In contrast, the standard errors tend to decrease with increasing sample size. Together we obtain that the identity between the standard deviation of the parameters and the mean standard error does not hold for the repetition effect (at least not for this range of sample sizes). While these findings are somewhat disturbing since they imply that considerable uncertainty in the parameter sizes is caused by sampling, findings on the direction of the repetition effect seem to be quite reliable. For all but the two smallest sample sizes we only estimate positive parameters (the negative estimates are never significant) and for p = 0.5 · 10 −4 all estimates are significantly positive. Thus, while there is considerable uncertainty about the size of the repetition effect, the danger of incorrectly rejecting the null hypothesis seems to be small. We further analyze why the repetition effect shows such an outlying behavior in Sect. 4.6.

Varying the number of controls per event
In the next family of experiments we fix the probability to sample events to p = 10 −5 , corresponding to about 3, 600 sampled events, and let the number of controls per event m vary from 1 to 128 where we double m in each step (compare Sect. 3.5). Thus, for the smallest m = 1 we sample about 7, 200 observations (events plus controls) and for the largest m = 128 we sample about 464, 400 observations, which are more than twice as many observations as we analyzed in Table 2 or in each of the samples considered in Sect. 4.2, but with a much higher ratio of controls over events. For each of these values of m we draw ten samples. Model parameters for the five effects estimated for the different sample parameter settings are displayed in Figure 4; summary statistics of estimated parameters, standard errors, and z-values for the five effects are in   Tables 10 to 14.) the Appendix in Tables 10 to 14. In general, the variability of parameters decreases for all but the repetition effect when the number of sampled controls increases for a fixed number of sampled events. Similar to the findings obtained by varying the number of sampled events, for the repetition effect (see Table 10) we observe that the variability caused by sampling (that is, the standard deviation of parameter estimates) does not decrease when increasing the number of observations. On the other hand, all parameter estimates are consistently positive and all estimates are significantly positive at the 5%-level for m = 128. In contrast to the standard deviation of the parameters, the mean standard errors do decrease with growing number of observations. Thus, for higher m more and more parameters are significantly different from zero. We can observe however, that the strategy to sample more controls per event needs higher numbers of observations to achieve the same standard errors. For instance, we need more than 464, 000 observations, 3, 600 of which are events, to bring down the standard errors to about 12.9, while 18, 000 events (p = 0.5 · 10 −4 ) together with 90, 000 controls yield a mean standard error of about 7.7 (see Table 5). This supports the rationale of case-control sampling that "the contribution of the nonfailures [controls], in terms of the statistical power of the study, will be negligible compared to that of the failures [events]" . In Sect. 4.5 we will analyze more systematically whether there is an optimal number of controls per event.
For the popularity effect (see Table 11) we find that not only the mean standard error, but also the standard deviation of the parameters gets smaller with growing number of observations for a fixed number of events. All parameter estimates for this effect are significantly positive, for all values of m that we considered in this type of experiments. We find that increasing the number of controls for a fixed number of events seems to require more observations to bring down the standard errors -or the standard deviation of parameters -than when sampling more events but fewer controls per event. For instance, the setting with sample parameters p = 10 −5 and m = 64 yields about 234, 000 observations and achieves about the same standard deviation of the parameters as the setting with p = 0.5·10 −4 and m = 5 which yields about 108, 000 observations. Findings for the activity effect (see Table 12) are very similar to those for the popularity effect.
Estimating the parameters of the four-cycle effect (see Table 13) requires more observations than for the degree effects. The setting with p = 10 −5 and m = 2 is actually the only one in which we estimate a parameter that is significant at the 5%-level and that points in the opposite direction than the corresponding parameter given in Table 2. Considering that we draw and analyze in total 360 samples, it seems to be acceptable to incorrectly reject a null hypothesis once at the 5%-level. We find in Table 13 that starting from m = 16 we only estimate significantly positive four-cycle parameters.
Estimated parameters for the assortativity effect (see Table 14) are in most cases negative (as in Table 2), the positive estimates are never significant, and starting from m = 64 we only estimate significantly negative parameters.

Varying the ratio of controls per event for a given budget of observations
In the final set of computational experiments we explore whether, given a limited budget to allocate to observations, it is preferable to draw fewer events and more controls per events or whether more events and fewer controls yield more reliable estimates. Thus, we fix the number of observations to about 216, 000the number of observations resulting from the initial sample parameters p = 10 −4 and m = 5 considered in Sect. 4.2. We let the number of controls per event m vary from 2 to 256 where we double m in each step (compare Sect. 3.5). For a given m we set the probability to sample events to p = 6·10 −4 /(m+1) to keep the number of sampled observations constant. Thus, for the smallest m = 2 we sample about 72, 000 events and 144, 000 controls and for the largest m = 256 we sample about 840 events and about 215, 000 controls. For each of these values of m and p we draw ten samples. Model parameters for the five effects estimated for the different sample parameter settings are displayed in Figure 5; summary statistics of estimated parameters, standard errors, and z-values for the five effects are in the Appendix in Tables 15 to 19. In general, the variability of parameters increases when we sample additional controls at the expense of sampling fewer events, so that sampling more events and fewer controls seems to be the better investment.
Findings for the repetition effect (see Table 15) fit into the general pattern that variability of estimates increases if the number of sampled events decreases and the number of controls increases accordingly. For this effect, a considerable jump becomes apparent when m changes from 2 over 4 to 8. For m = 2 the standard deviation of the parameters is 3.33 and the mean standard error is equal to 0.97. These numbers increase to a standard deviation of 9.09 and a mean standard error of 1.87 for m = 4 and to a standard deviation of 26.06 and a mean standard error of 13.34 for m = 8. For higher values of m, the standard deviation is consistently above 20 and the mean standard error consistently above 10. Starting from m = 64, some of the estimated parameters are no longer significant at the 5%-level, although all estimates are positive.
These findings seem to suggest that increasing the ratio of events over controls, while keeping the number of observations constant, is a strategy for reliably estimating the repetition parameter. As a cautionary note we want to indicate already here that results from Sect. 4.6 suggest that this is misleading: the repetition variable is very skewed and almost all controls have the same value (equal to zero). We discuss estimation of the repetition effect in more detail in Sect. 4.6.
Parameter estimates for the popularity effect (see Table 16) are almost unaffected by the ratio of controls over events for values from m = 2 to m = 32. Standard errors and standard deviation of the parameter estimates increase slightly for m = 64 and m = 128 and increase sharply for m = 256. For this last value  we also get the first non-significant estimates, although all estimates are positive. Findings for the activity effect (see Table 17) and the four-cycle effect (see Table 18) are similar to those for the popularity effect, showing a sharp increase in variability for m = 256 -although all estimated parameters for activity and four-cycle remain significant for the analyzed range of m.
For the assortativity effect (see Table 19) we make the same observation that variability increases sharply for m = 256. Some estimates for this parameter become insignificant for m = 32 or higher, although all estimates are negative.
In general, it seems to be a more useful investment to sample events, rather than controls, so that the number of controls per event is best set to a small number. For values in the range of m = 2 to m = 16 the actual value seems to make little difference. The remarkable differences between m = 2, 4, or 8 for the repetition effect rather seems to be a spurious finding, as discussed in Sect. 4.6.

Examining the repetition effect
A common pattern in the results discussed above is that the estimated parameters of the repetition effect had a much higher variability across samples than those of the other effects. Disturbingly -and in contrast to findings for the other effects in the model -the variability of the repetition parameter does often not decrease when we increase the sample size, at least not for the ranges of sample parameters considered so far. We have also found -again in contrast to findings for the other effects -that the standard errors of the repetition effect are a poor approximation for the variability caused by sampling. Thus, while the direction of this effect is consistently positive, its magnitude can only be estimated with a relatively high degree of uncertainty.
In this section we perform an additional analysis to find explanations why the estimated parameters of the repetition effect have such a high standard deviation, compared to the other effects. We analyze the combined 100 samples obtained with p = 10 −4 and m = 5, that is, the aggregation of the samples used in Sect. 4.2. These constitute more than 21 million observations, 16.7% of which are events and 83.3% are controls. The densities, that is, the proportions of observations with non-zero values for the five statistics are reported in Table 4. We compute and report densities separately for all observations, restricted to events, and restricted to controls (i. e., dyads not experiencing an event at the given time).
In short, the reason for the anomalous behavior of the repetition parameter seems to be the extremely low number of controls that have a non-zero value in the repetition statistic. The density of the repetition statistic restricted to controls is just about 6 · 10 −6 . In the aggregation of all 100 samples, we have about 100 controls with non-zero repetition values. Thus, each of the samples analyzed in Sect. 4.2 has on average just one control with a non-zero repetition value. The high variability of estimated repetition parameters across samples seems to be caused by the randomness of sampling none, just one, or several controls with a non-zero value in the repetition statistic.
We can now also better understand the result reported in Sect. 4.5 (also see Table 15) that sampling fewer controls (while increasing the number of events accordingly) seems to bring down the variability of the repetition parameter to a relatively low value. Decreasing the number of sampled controls actually also decreases the probability that the sample includes an exceptional non-event dyad with non-zero repetition variable. For low numbers of controls it becomes unlikely to sample even a single one of these exceptional controls. In such situations, the parameter estimates become only seemingly reliable, since the sample variation of the explanatory variable among the controls suddenly drops to zero. Clearly this is misleading: if our sample accidentally included one of the exceptional controls with non-zero repetition variable, then the parameter estimate would change strongly.
The other four statistics are much more homogeneously distributed. The statistic with the next lowest density is the four-cycle statistic, which has a density of 0.04 among the controls. This translates to more than 760, 000 controls with a non-zero four-cycle statistic in the aggregation of all 100 samples and on average more than 7, 600 in each of the samples analyzed in Sect. 4.2. Apparently this number is sufficient to estimate parameters with low uncertainty caused by sampling. The other statistics, popularity, activity, and assortativity, are more evenly distributed than the four-cycle statistic. We discuss in Sect. 5.1 a possible way to reduce the variability of the repetition parameter.
We emphasize that it is important to check the density of non-zero values -or, more generally, the distributions of explanatory variables -separately for events and controls. Table 4 reveals that the density of the repetition statistic over all observations (events and controls) is not exceptional.

Summary of results and future work
In general, findings reported in this paper are encouraging. Reliably estimating relational event models on event networks with millions of nodes and hundreds of millions of events seems to be possible by sampling some tens of thousands of events and a small number of controls per event. For some of the analyzed effects -most notably the degree effects popularity and activity -an even smaller sample size comprising a few hundred events seems to be sufficient. In general, z-values (that is, parameters divided by their standard errors) seem to provide a reliable criterion for whether estimated parameters are significantly positive or negative. This suggests that one way to scale up REMs is by applying relatively simple and well-established sampling schemes.
Thus, the direction of effects can apparently be estimated in a reliable way with relatively small sample sizes. A different task however is to assess the size of parameters with small variability. In this aspect, the repetition effect turned out to be exceptionally difficult.
For those settings of the sample parameters that yield a low parameter variation caused by sampling, the standard deviation of parameters across samples is approximately equal to the mean standard errors, as suggested by asymptotic results. This is a convenient result since standard errors can be estimated from a single sample. Our results related with the repetition parameter, however, indicate that this approximate identity can not be taken for granted: the standard deviation of the repetition parameter is consistently much higher than its typical standard error. We identified the extreme sparseness of the repetition variable among the controls as a likely reason for this finding. Only six controls in a million have a non-zero value in the repetition variable. Reliable estimation seems to need more than just a few of these exceptional controls, implying that the sampled set of controls should have several millions of elements (see, however, Sect. 5.1 for a more efficient way to get reliable estimates).
Due to the above findings, analysts should not just estimate parameters and standard errors for a single sample and draw their conclusions from the resulting z-values. It is definitely recommendable to check the distribution of statistics separately for sampled events and for sampled controls. It seems to be problematic if any of these distributions reveal that almost all controls, or almost all events, have the same value. Just checking the distribution of statistics over all observations (events and controls) is insufficient to detect outlying statistics -as is the variance, covariance, or correlation of statistics computed over all observations, see Table 1. Additionally to checking empirical distributions one could also infer by theoretical arguments whether certain statistics are likely to be heavily skewed. In our concrete model we could conclude by comparing the size of the risk set (more than 30 trillion) with the number of dyadic events (about 360 million) that the repetition variable cannot be evenly distributed over the risk set. (On the other hand, it seems to be much more difficult to conclude from theoretical considerations why the four-cycle statistic has a relatively high density, equal to 0.04, among the controls.) In situations in which almost all controls take a single value in some statistic, the most efficient way seems to change the sampling design as it is sketched in Sect. 5.1.
If computational resources are sufficient, analysts can draw several samples with the same sample parameters, fit models separately to these, and then assess the variation caused by sampling (as we do in this paper). If for computational reasons (or otherwise) repeated sampling is not possible, bootstrap sampling (Efron 1992) might be an alternative to assess variation of parameters from one given set of observations. (We experimentally applied bootstrap sampling to one single sample and found parameter variation estimated from bootstrapping to be very similar to the one obtained in Sect. 4.2.) However, in light of the findings discussed in Sects. 4.5 and 4.6, re-sampling (or bootstrap sampling) does not render it unnecessary to check distributions of statistics: if some statistic is extremely degenerate, even large combined sample sizes might miss the exceptional observations and would incorrectly suggest a small parameter variation.
Inspecting the values of statistics separately on sampled events and sampled controls might reveal that, for instance, all sampled controls take the same value, and thus could point to near-degenerate distributions.
We also analyzed the trade-off between sampling more events and fewer controls per event versus sampling fewer events and more controls per event, given a limited budget of observations. Results seem to confirm that events are more valuable than controls  so that m, the number of controls per event is best kept at a small number -provided that the number of observed events to sample from is sufficiently high. In our experiments, the actual value of m seemed to make little difference for values up to m = 16, but choosing higher values of m -at the expense of sampling fewer events -increased parameter variation.
We fitted model parameters also after standardizing explanatory variables (these estimates are not reported in this paper). That is, we transformed all the statistics by subtracting their mean and dividing by their standard deviation before fitting the model. In general, standardization resulted in lower variation of the estimated model parameters; this applies especially to the repetition effect, as variation for the other effects is small anyway. However, dividing statistics by their standard deviation has a side effect when we want to compare estimations on samples with varying numbers of controls per event. Typically, controls have much smaller values in the various statistics. Thus, if we include more controls per event we typically get lower standard deviations of statistics. Due to these data characteristics, standardization of statistics resulted in parameter estimates that systematically tend to zero with higher ratio of controls per event (although standardization does not affect z-values, since standard errors are scaled with the same factor as parameters). In this paper, we leave it open whether standardization of statistics is recommendable in general. In our study we do not want to scale statistics with different factors in different samples -for this reason we report results obtained without standardization. Yet, in other studies standardization might be appropriate to improve comparability of parameter sizes over different effects.

Improving estimation of the repetition parameter: stratified sampling
Reliable estimation of the size of the repetition parameter turned out to be much more difficult than for the other effects. We have found in Sect. 4.6 that the extreme sparsity of the repetition statistic among the controls is a likely cause for this. One obvious way to bring down the sample variance of the repetition parameter is to further increase the sample size, until the samples include a sufficiently high number of controls with non-zero values. However, this approach is inefficient. A more efficient approach seems to be stratified sampling  where we divide observations into strata -for instance, those dyads that have zero weights, compare Sect. 3.3, and those that have non-zero weights -and sample controls evenly from these strata.
When applying stratified sampling we have to re-weight observations when estimating the Cox-proportional hazard model. Stratification typically leads to estimates with lower variation ; it has been applied to fitting REM in Vu et al. (2015). We do not consider stratified sampling in this paper, but suggest it as a strategy to estimate REM in situations with near-degenerate explanatory variables, that is, variables that are (almost) constant on the vast majority of instances, but take considerably different values on few instances.

Beyond the Cox proportional hazard model
In this paper we considered the Cox proportional hazard model where the estimated parameters explain which statistics increase or decrease the rate of events but where the baseline hazard λ 0 (t), see Eq. (1), is left unspecified. This model, which corresponds to the "ordinal model" from Butts (2008), is appropriate in many situations since we are often less interested in the absolute rate of events but rather in the factors that increase or decrease the likelihood of events. In our setting, the baseline hazard λ 0 (t) corresponds to the rate of edit events at time t in the whole of Wikipedia, that is, independent of the active user or the target article. This global rate of events changes systematically over the lifetime of Wikipedia and also changes periodically by the day of the week and the time of the day (Yasseri et al. 2012). Thus, it seems to be even preferable to filter out these regularities in the event rate -as it is done in the Cox proportional hazard model -if the goal of a study is to analyze network effects explaining dyadic event rates.
Yet, in other applications of relational event models analysts might be interested in the absolute event rate, or from another point of view, in the expected time to events, and not just in whether the rate on some dyads tends to be higher than on others. Relational event models that achieve this are well-known (Butts 2008) -but applying case-control sampling to these REMs that take time information from a ratio scale, rather than an ordinal scale, brings up some additional issues. It is easy to see that manipulating the number of controls per event -as it is done by case-control sampling -affects at least the intercept of a parametric specification of the full hazard function. (In contrast, a Cox proportional hazard model has no intercept since an intercept would be absorbed by the baseline hazard.) An alternative for analysts interested in the absolute hazard rate would be to estimate the integrated baseline hazard Λ 0 (t) = t 0 λ 0 (t ) dt , where λ 0 (t) is the baseline hazard from Eq. (1), in a non-parametric way -an approach that can be combined with case-control sampling . If we also sample from the observed events with probability p < 1, additionally to case-control sampling, we just have to scale the obtained baseline hazard with 1/p.

Reproducibility
Computation of explanatory variables for the experiments presented in this paper has been done with the open-source software eventnet 3 and parameter estimation has been done with the survival package 4 (Therneau and Grambsch 2013) in the R software for statistical computing. The preprocessed list of 360 million relational events which constitutes our input data is publicly available at https://doi.org/10. 5281/zenodo.1626323. The eventnet configuration, and an R script to fit models are also available online. 5 This facilitates to reproduce experiments as the ones presented in this paper -potentially varying sample parameters, model statistics, model estimation procedure, or the empirical data used to fit the model.

Conclusion
Relational event networks naturally result from computer-mediated communication and collaboration or automated data collection strategies that record social interaction. Networks obtained in this way are often large. Relational event models provide an appropriate statistical framework to model event networks without having to aggregate dyadic events over time intervals. Yet, studies that apply REM to large event networks are surprisingly rare.
Previous work (Vu et al. 2015) proposed case-control sampling to reduce the size of the risk set which is often quadratic in the number of nodes. In this paper we systematically assess the variation in parameter estimates that have been obtained by a combination of case-control sampling and uniform sampling from the observed events.
Results from this paper give an encouraging message. It seems to be possible to reliably fit relational event models to networks with millions of nodes and hundreds of millions of events by analyzing a sample of some tens of thousands of events and a small number of controls per event. Some network effects, such as degree effects, seem to require even much smaller sample sizes comprising a few hundred events to be reliably estimated. We identified characteristics of the data that might cause a relatively high variability in the estimation of some effects and recommend that analysts should check distributions of explanatory variables -separately for sampled events and controls -to recognize such situations.
Future work might experimentally assess empirical properties of different sampling strategies, including stratified sampling (Vu et al. 2015) which might be a way to overcome situations of very skewed variables. Additionally to case-control sampling and sampling observed events used for fitting the model, one might also consider sampling from the input events that define the network of past events, see Sect. 3.3. Such an approach might be a way to analyze even larger event networks -especially if the unrestricted network of past events is too large to be kept in computer memory. In this paper we maintained the network of past events as a function of all input events without sampling, since this ensures that explanatory variables are computed correctly. Sampling from the input events would require to conduct sensitivity experiments similar to the ones presented in this paper. (We note that the scenario of using only the sampled events to build up the network of past events is equivalent to settings with missing data -given the assumption that dyadic observations are missing completely at random. In our situation, where we have the full data available and sample only to reduce computational runtime, the assumtion of "missing completely at random" is guaranteed by design.) Last but not least, computation times of statistics could potentially be reduced by improvements in graph algorithms. For instance, in our model the computation of the four-cycle statistic has an asymptotic runtime which is by orders of magnitude larger than the runtime needed to compute the other statistics. Algorithms for more efficiently maintaining counts of four-cycles in dynamic networks exist (Eppstein et al. 2012) but do not match the usage scenario of this paper where we have many updates and need four-cycle counts only for a much smaller number of dyads.
Conflicts of interest: the authors have nothing to disclose.