1. Introduction
Gene expression is the process by which genes in different organisms are activated to produce proteins when they are needed to carry out their function. Data relating to gene expression data contains key information about intracellular biological processes (Kothapalli et al., Reference Kothapalli, Yoder, Mane and Loughran2002). Gene coexpression datasets typically describe the expression level of different genes across different samples often taken under different experimental conditions. Such data are frequently represented as gene coexpression networks, with nodes representing genes and edges representing correlations in expression between pairs of genes across multiple samples (Lee et al., Reference Lee, Hsu, Sajdak, Qin and Pavlidis2004). Representing gene coexpression as networks helps in the study and visualization of the expression data and the exploitation of the structure of interactions between genes at a wholesystem level (Weirauch, Reference Weirauch2011; Magwene & Kim, Reference Magwene and Kim2004). One motivation behind creating these networks is that genes which are highly coexpressed across multiple samples are likely to have related functions (Hughes et al., Reference Hughes, Marton, Jones, Roberts, Stoughton, Armour, Bennett and Friend2000; Stuart et al., Reference Stuart, Segal, Koller and Kim2003; van Noort et al., Reference van Noort, Snel and Huynen2003; Makrodimitris et al., Reference Makrodimitris, Reinders and van Ham2020), allowing inference of gene function using guilt by association approaches (Wolfe et al., Reference Wolfe, Kohane and Butte2005). This procedure is especially useful if the studied organism is poorly annotated. However, the lack of reliable genomic information can hinder the validation of the accuracy of the network models generated from gene expression data. Noise in the data makes it difficult to distinguish genes that are expressed at a low level from those not expressed. Therefore, there is a need for network construction pipelines that are robust to experimental error, yet capture most of the information in the data.
Multiple strategies to construct a gene coexpression networks from gene expression data are available (BarJoseph et al., Reference BarJoseph, Gerber, Lee, Rinaldi, Yoo, Robert and Young2003; Ihmels et al., Reference Ihmels, Friedlander, Bergmann, Sarig, Ziv and Barkai2002; Ucar et al., Reference Ucar, Neuhaus, RossMacDonald, Tilford, Parthasarathy, Siemers and Ji2007; Langfelder & Horvath, Reference Langfelder and Horvath2008; George et al., Reference George, Singh, Lokappa and Varkey2019). A recent novel approach uses signed distance correlation (PardoDiaz et al., Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021). This approach identifies coexpression relationships between genes and produces robust networks that capture more biological information than those obtained using alternative metrics such as Pearson correlation, Spearman correlation, and mutual information (PardoDiaz et al., Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021). Signed distance correlation is based on distance correlation (Székely et al., Reference Székely, Rizzo and Bakirov2007), a measure that evaluates an association between the pattern of changes in the samples, and provides a nonnegative score that is zero if and only if the expression vectors are statistically independent. Because it is unsigned, distance correlation values do not permit the differentiation of genes with the same expression pattern and genes expressed at opposite times. Signed distance correlation overcomes this problem by combining the distance correlation value with a sign that indicates the direction of the association. The sign corresponds to the sign of the Pearson correlation between the expression of the two genes across the samples in the dataset. Figure 1 illustrates how signed distance correlation is obtained.
To our knowledge, signed distance correlation has been used in this field only to generate unweighted networks, using the R package COGENT (Bozhilova et al., Reference Bozhilova, PardoDiaz, Reinert and Deane2020). COGENT aids the selection of a robust network construction method without the need for any external validation data. The main functions in this package iteratively split the gene expression data into two different sets, construct a network from each of them and then evaluate their similarity; the higher the similarity of the networks, the more robust the construction method. The functions implemented in COGENT to measure the similarity between the two generated networks are suitable to analyze only unweighted networks and complete weighted networks.
COGENT assists the selection of the optimal threshold value so that only pairs of genes for which the correlation value of their expression exceeds the threshold are connected in the network. The use of a threshold to construct networks results in a relatively sparse unweighted network. However, unweighted networks obtained through thresholding weighted edges may ignore important information about the strength of the correlation. Avoiding the use of a threshold results in complete weighted networks in which the weights of the edges are proportional to the correlation values. In this type of networks, it may be difficult to distinguish the signal from the noise. Thus, there is a need for a method that eases the signal and noise separation while keeping information about the strength of the correlation.
In this paper, we present a method to construct weighted and thresholded gene coexpression networks in which the sparsity can be controlled by assigning weighted edges only to those pairs of genes with an expression correlation higher than a given threshold. The weights of the edges correspond to the correlation values. To select the threshold value, we have extended COGENT, including a comparison methodology that allows us to evaluate the similarity between weighted and thresholded networks.
Using this extension, we construct and compare networks constructed from gene expression data using Pearson or signed distance correlation. For ease of comparison and breadth, we analyze the three datasets presented in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021) as they do not only correspond to different organisms (bacteria, yeast and human) but also are derived from three different experimental techniques: microarrays, RNASeq and singlecell RNASeq. When comparing the networks, we evaluate the robustness measurement from COGENT and the amount of biological information they capture according to STRING, a protein–protein interaction database with scores for pairs of proteins (Szklarczyk et al., Reference Szklarczyk, Gable, Lyon, Junge, Wyder, HuertaCepas and von Mering2019). The higher the STRING score for a protein pair, the more likely the pair is to have a biologically meaningful functional relationship. Using STRING, we show that for our datasets, networks constructed using signed distance correlation capture even more biological information and are structurally more stable than networks based on Pearson correlation.
We also compare the resulting weighted and thresholded networks based on signed distance correlation to the unweighted networks obtained using the same metric as presented in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021). In this comparison, the weighted and thresholded networks capture more biological information than the corresponding unweighted networks, according to STRING.
While we apply our method to gene expression data, our method to construct networks from signed distance correlations (in combination with COGENT) can be used in applications beyond gene expression and beyond bioinformatics.
Data and source code are available from https://github.com/javierpardodiaz/sdcorGCN and http://opig.stats.ox.ac.uk/resources.
2. Methodology
2.1 Datasets and preprocessing
We analyze the three datasets employed in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021):

RL3841: A collection of 54 microarrays measuring the expression of 7,077 genes of the bacterium Rhizobium leguminosarum bv. viciae 3841.

Yeast: A dataset obtained using RNASeq of yeast (Saccharomyces cerevisiae) expressing pathways designed to increase ATP or GTP consumption. We use all the rawcounts for experiment EMTAB5174 in Expression Atlas (Petryszak et al., Reference Petryszak, Keays, Tang, Fonseca, Barrera, Burdett and Brazma2016) and remove the genes with zero expression variance. The final dataset which we feed into our pipeline includes the expression of 6,930 genes across 209 samples.

Human liver: A dataset obtained using singlecell RNASeq of human liver cells (Klimm et al., Reference Klimm, Toledo, Monfeuga, Zhang, Deane and Reinert2020). The original dataset measures the expression of 15,353 genes in 1,622 cells.
The three datasets in the form of expression matrices (genes in rows and samples in columns) are publicly available online. We denote the expression matrices by M.
As detailed in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021), for the RL3841 and yeast datasets, we apply quantile normalization (Bolstad et al., Reference Bolstad, Irizarry, Åstrand and Speed2003) to the gene expression matrix M. This normalization enables us to compare data from different experiments. To avoid interference from low expression values in the quantile normalization, we ignore the 20% least expressed genes from each sample before the normalization step. After the quantile normalization, we set the ignored values to the lowest expression value in M to decrease the level of noise.
For the human liver dataset, as described in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021), we follow a different approach due to differences in the data—the gene expression levels in the dataset correspond to the expression of the genes in different cells instead of to different samples—and the organism—while both R. leguminosarum and S. cerevisiae are unicellular organisms, humans are not. Hence, a considerable proportion of genes are not expressed in the studied cells. For this dataset, we quantilenormalise the data (Bolstad et al., Reference Bolstad, Irizarry, Åstrand and Speed2003) to make the measurements in the different cells comparable. Afterwards, as in Prieto et al. (Reference Prieto, Risueño, Fontanillo and De Las Rivas2008), we identify the nonchanging genes. These genes are those for which the difference between its highest and lowest expression value (“expression difference”) is lower than the median of all the expression differences calculated for each gene, and for which the mean expression signal between samples is lower than the median of all the expression signals calculated for each gene. After removing the “nonchanging genes”, we obtain an already quantilenormalised dataset with information for 8,585 genes.
In all three datasets, we denote the preprocessed gene expression matrix by $M^*$ . This matrix has n rows (genes) and m columns (samples).
2.2 Correlation matrices
From each preprocessed gene expression matrix $M^*$ , we compute two correlation matrices S and P, both of them with dimensions $n \times n$ . The matrix S contains the signed distance correlation values for the expression of each pair of genes, whereas the matrix P includes the pairwise Pearson correlation values. The matrix S is the result of assigning the sign of the values in the matrix P to the distance correlation values (Székely et al., Reference Székely, Rizzo and Bakirov2007) between the expression of the genes. Assigning a sign to the (unsigned) distance correlation values allows us to differentiate positive and negative correlation values. The pipeline followed to construct the expression matrices is described in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021) and depicted in Figure 1.
2.3 Weighted and thresholded network construction
In each of our three datasets, we use the gene correlation matrices S and P to obtain weighted and thresholded networks. These networks contain edges only between those pairs of genes that show a correlation of their expression greater than a given threshold. Unlike in unweighted networks, we assign a different weight to each edge. The weight of the edge corresponds to the value of the correlation of the expression between the two genes. The weighted and thresholded networks obtained using the signed distance correlation matrix S with threshold $\phi$ have an adjacency matrix $B_S(\phi)$ :
We construct the weighted and thresholded networks from the Pearson correlation matrix P in a similar way.
We use an extension of COGENT (Bozhilova et al., Reference Bozhilova, PardoDiaz, Reinert and Deane2020) to find a threshold value $\phi^*$ that results in a network with optimal selfconsistency. The main idea behind COGENT is that the more similar two networks obtained using the same network construction method and from different overlapping subsets of the same dataset are, the more selfconsistent the employed method is: despite the changes in the dataset, the topology of the resulting networks is similar. Heuristically, the networks obtained using a selfconsistent construction method are selfconsistent themselves since small changes in the datasets will not affect their structure. Our contribution to COGENT relies on the addition of a method that allows assessing the similarity between weighted and thresholded networks (Equation (2)).
As in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021), we run 25 COGENT iterations in which the samples (columns) in M are grouped into two overlapping sets $M_1$ and $M_2$ . Both sets share half of the total number of samples, and differ in 1/3 from each other. At each iteration, we generate 25 pairs of correlation matrices $S_1$ and $S_2$ and test different threshold values. For each iteration and threshold value, the correlation matrices $S_1$ and $S_2$ are thresholded (Equation (1)) and turned into networks $H_1$ and $H_2$ . The processes taking place in each of the COGENT iterations to obtain the networks $H_1$ and $H_2$ are depicted in Figure 2. To calculate a similarity between the two networks $H_1$ and $H_2$ , we use an adjusted weighted Jaccard index that permits the comparison between networks with different edge densities:
where
$R_1$ and $R_2$ are randomizations of the adjacency matrices $H_1$ and $H_2$ , respectively, and n is the number of nodes in the network. To generate the randomizations, we first permute the rows in the correlation matrices and then shuffle the columns using the same permutation. The networks $R_1$ and $R_2$ have the same topological properties as $H_1$ and $H_2$ and allow us to estimate the expected similarity between the analysed networks and random networks with the same edge weight distribution. Without this correction, the networks derived from low threshold values would have the highest similarity values.
The similarity score associated with each threshold $s(\phi)$ is the average of the similarity of 25 pairs of networks (Equation (2)). Then, to favor signal over noise, we prioritise those networks with a low sum of edge weights and obtain the Score( $\phi$ ) value associated to each threshold $\phi$ :
where n denotes the number of nodes in the network (i.e. the number of genes). After computing the scores for all tested thresholds, we select the threshold value $\phi^*$ that results in the highest score Score $(\phi^*)$ . We denote the optimal weighted and thresholded network obtained using signed distance correlation as $WS(w_S)$ . We denote the optimal sum of edge weights for the signed correlation by $w_S$ .
Following the same pipeline, we select the unique threshold value $\phi^\star$ that results in the optimal network $WP(w_P)$ (sum of edge weights = $w_P$ ) obtained using the Pearson correlation matrix P.
2.4 STRING evaluation of weighted and thresholded networks
We assess the amount of biological information contained in the just generated weighted and thresholded networks $WS(w_S)$ and $WP(w_P)$ using STRING, a database of known and predicted protein—protein interactions (Szklarczyk et al., Reference Szklarczyk, Gable, Lyon, Junge, Wyder, HuertaCepas and von Mering2019). STRING collects information from numerous sources, including experimental data, computational predictions, and textmining. The association evidence in STRING is categorized into independent channels, weighted, and integrated to produce a confidence score C for all recorded protein interactions. Interactions with a high C score are more likely to be true than those with a low score. To evaluate the networks, we employ the same set of confidence scores as in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021):

C: Total scores provided by STRING

$C^\dagger$ : Scores that only consider coexpression information

$C^\ddagger$ : Scores that exclude coexpression information.
We expect the overlap of our networks with the scores in $C^\dagger$ to be higher than the overlap with the scores in $C^\ddagger$ since our input datasets contain gene expression information. Nevertheless, both $C^\dagger$ and $C^\ddagger$ are interesting to analyse: the first one might be an indicator of how well the networks capture coexpression relationships whereas the latter can be used as an indicator of how well coexpression can predict other types of relationships.
For each network and confidence score set, we compute the dot product of the edge weights and the STRING confidence score. Afterwards, we divide the result by the sum of the weights of all edges in the network. Equation (5) indicates how we obtain the STRING score for a weighted network with adjacency matrix $B_S(\phi)$ and confidence score set C:
where n denotes the number of nodes in the network, $B_S(\phi)_{i,j} $ is the the entry for genes i and j in $B_S(\phi)$ , and $C_{i,j}$ refers to the confidence score for the genes i and j in the confidence score set C.
Despite being normalised for the sum of edge weights, the score might be biased towards networks with fewer and heavier edges. For this reason, and in order to keep the comparison between the networks obtained using Pearson and distance correlation the fairest possible, we construct and evaluate six additional networks:

$WS(w_P)$ : Network from S with the same sum of weights of edges as $WP(w_P)$

$WS(e_P)$ : Network from S with the same number of edges as $WP(w_P)$

$WS(a_P)$ : Network from S with the same average edge weight as $WP(w_P)$

$WP(w_S)$ : Network from P with the same sum of weights of edges as $WS(w_S)$

$WP(e_S)$ : Network from P with the same number of edges as $WS(w_S)$

$WP(a_S)$ : Network from P with the same average edge weights as $WS(w_S)$ .
We also generate two sets of 30 random networks to compare the results obtained in the STRING evaluation with those expected by chance. The sets of random networks have the same number of edges and distribution of edge weights as $WS(w_S)$ and $WP(w_P)$ , respectively. We generate the networks by assigning the edges of the original networks to randomly picked pairs of vertices as a Bernoulli graph with a fixed number of edges and weights, and compute their STRING scores.
2.5 Comparison of weighted and thresholded networks and unweighted networks
Next, we assess whether the weighted and thresholded gene coexpression networks we obtain capture more biological information than unweighted coexpression networks. We compare our networks with the optimal unweighted signed distance correlation networks from PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021), denoted by $NS(d_S)$ . We analyse and compare four networks:

$WS(w_S)$ : Optimal weighted network from S

$NS({w_S})$ : Unweighted network with the same edges as $WS(w_S)$

$NS(d_S)$ : Optimal unweighted network from S

$WS({d_S})$ : Weighted network with the same edges as $NS(d_S)$ .
We focus only on signed distance correlation networks because unweighted and weighted signed distance correlation networks are both more selfconsistent and capture more biological information than their matching Pearson networks (see Results section).
Unweighted networks, by definition, do not have weights associated with their edges; in order to be able to compare the networks using the scoring function from Equation (5), we assign to edges of the networks $NS({w_S})$ and $NS(d_S)$ as weight the average edge weight of the networks $WS(w_S)$ and $WS({d_S})$ , respectively. All the edges in an unweighted network thus have the same weight.
3. Results
3.1 Network construction and COGENT evaluation
We evaluate how the score retrieved using COGENT changes across different thresholds in each dataset and for each correlation matrix. This score depends on the similarity of the networks constructed at each COGENT iteration. The similarity is adjusted using the overlap expected between each of the networks and a network with the same edge weights distributed randomly, and then adjusted to prioritise sparser networks (see Equations (2) and (4)). Figure 3 illustrates how the signed distance and Pearson scores change for different edge weight values in the three datasets. In all the datasets (RL3841, yeast, and human liver), the curve shows a similar shape and there is an edge weight value for which the scores reach their maxima. The thresholds associated with those edge weights are the optimal thresholds $\phi^*$ and $\phi^\star$ which we choose when constructing the networks $WS(w_S)$ and $WP(w_P)$ . The score, threshold, and summaries of these networks are shown in Table 1. In all cases, the highest scores obtained using signed distance correlation are higher than those for Pearson correlation. Therefore, weighted and thresholded gene coexpression networks based on signed distance correlation can be more selfconsistent than those based on Pearson correlation.
3.2 STRING evaluation
We assess the amount of biological information contained in the networks following the methodology described in the Methods section. Tables 2–4 present the networks constructed using the different datasets, their metrics, and their STRING scores. The optimal signed distance correlation networks $WS(w_S)$ retrieve a higher score than all alternatives based on Pearson correlation for all the studied datasets. The optimal Pearson correlation networks $WP(w_P)$ retrieve a lower score than most of their (signed distance) competitors. The only exception to this trend is the yeast network $WS(a_P)$ (signed distance correlation network with the average edge weight of the optimal Pearson correlation network) which returns a lower score than the network $WP(w_P)$ (optimal Pearson correlation network). These results suggest that using signed distance correlation for generating weighted and thresholded networks from gene expression data can yield better results than using Pearson correlation. Figure 4 shows the STRING scores obtained by the networks with the sum of edge weights $w_S$ and $w_P$ for the three datasets and the different sets of evidence.
We observe that for both correlation matrices, a higher average edge weight (and a lower number of edges) implies a higher STRING score. Figure 5 shows the high positive correlation between the STRING score using the information set C and the average edge weight of the networks for the RL3841 dataset. Still, often signed distance correlation networks retrieve a higher STRING score than Pearson networks even when their average edge weight is lower. For example, for the RL3841 dataset, the only case in which a Pearson correlation network retrieves a higher STRING score than a signed distance correlation network is when comparing the network $WS(a_P)$ (signed distance correlation network with the average edge weight of the optimal Pearson correlation network, with score 24.78) versus the network $WP(a_S)$ (Pearson correlation network with the average edge weight of the optimal signed distance correlation network, with score 24.86). However, this comparison lacks interest since their average edge weights (0.6789 and 0.6451) are quite different.
The random networks obtain a lower STRING score than Pearson and signed distance correlation networks (see Tables 2–4 and Figure 3). We find the highest difference between the random expected STRING scores and those retrieved by the networks when using only coexpression information ( $C^\dagger$ ). The scores obtained by the networks $WS(w_S)$ and $WP(w_P)$ are 9.63 and 6.89 times higher than the score expected by chance in the case of the RL3841 dataset. The smallest differences correspond to the exclusion of the coexpression information ( $C^\ddagger)$ , where the values are 3.62 ( $WS(w_S)$ ) and 2.84 ( $WP(w_P)$ ) times higher than the randomly expected values for the RL3841 dataset.
3.3 Comparison with unweighted networks
Next, we compare the amount of biological information captured by weighted and thresholded networks, and unweighted networks, constructed using signed distance correlation. We compare the two groups of networks with the same sets of edges as described in the Methods section: the optimal set of edges for the weighted and thresholded network and the optimal set for the unweighted network in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021). The weights in the unweighted networks are set to the average weight in the matching weighted and thresholded networks. The results for the STRING evaluation for the different datasets are shown in Table 5. For the three datasets, the weighted and thresholded networks can capture more of the biological information and should therefore be preferred. For the human liver dataset, we present only a set of two networks since the optimal threshold for the weighted and the unweighted networks is the same and therefore they have the same edges.
4. Discussion and conclusions
The method we propose combines the intuitiveness of the unweighted networks, as there are edges connecting only those pairs of genes with a high correlation in their expression, with the finetuning provided by assigning different weight values to different edges. This combination allows differentiating genes which are highly coexpressed from genes that, even if they are coexpressed, are not highly associated. These characteristics can help to study and analyze the networks.
We select the optimal threshold value to generate the weighted and thresholded networks using an extension of COGENT (Bozhilova et al., Reference Bozhilova, PardoDiaz, Reinert and Deane2020) that allows assessing the similarity of this type of networks. For each correlation matrix, we select the threshold value that results in a high selfconsistency and prioritizes signal over noise. We also normalized the correlation values that are higher than the threshold to lie between zero and one, using these normalized values as, as follows: edges connecting genes with an expression correlation less or equal to the threshold had a weight of zero; edges connecting genes with an expression correlation equal to one had a weight of one; edges connecting genes with a expression correlation between the threshold and one had weights between zero and one. We evaluated the selfconsistency of the obtained network as described in Section 2.3. We observed that this normalization step resulted in a similar selfconsistency. For the signed distance correlation network, the optimal threshold was 0.52, resulting in a score of 0.4582. This value is lower than the one obtained without the normalization: 0.4584. For the Pearson correlation networks, we also obtained higher values when not using the normalization. In light of these results, we do not include this normalization step in our pipeline.
The correlation matrices S and P differentiate positive from negative correlations. However, the threshold value we select to construct the networks is always greater than zero and therefore the weights of the edges in the network are always positive. As discussed in PardoDiaz et al. (Reference PardoDiaz, Bozhilova, BeguerisseDaz, Poole, Deane and Reinert2021), we construct networks that include only positive correlations since negative correlations do not always imply a functional relationship. Nevertheless, a signed network with weights associated with its edges might include valuable information since the sign of the weights allow to differentiate positive and negative associations. Exploring a modification of our network construction pipeline of this type might provide an improvement.
For the three studied datasets, the use of signed distance correlation to generate weighted and thresholded networks results in a higher selfconsistency than the use of Pearson correlation. Regarding the STRING evaluation, most times, the signed distance correlation networks also obtain a higher score than their competitors constructed using Pearson correlation. The only exception to this trend is when comparing networks with the same average edge weight in the yeast dataset. However, overall, the presented results suggest that the networks constructed using signed distance correlation capture more biological information.
As shown in Table 5, our weighted and thresholded networks capture more biological information than unweighted networks. We extract the same conclusion from the analysis of the three datasets. This result is in line with what we expect since the use of edge weights representing the strength of the correlation between the expression of the pairs of genes results in an increase in the amount of information contained in the network.
The threshold values that we select to generate the weighted and thresholded networks from the three datasets are very similar to those that we use to construct the unweighted networks. This fact suggests that independently of the construction methodology we follow, the barrier between signal and noise is in the same range of correlation values for a given correlation matrix.
We use our method to construct a novel weighted and thresholded gene coexpression network for R. leguminosarum. This network promises to reveal rich biological information and it is, therefore, the starting point for further investigations of the biological mechanisms of this organism. In particular, we plan to identify groups of genes in R. leguminosarum which are highly connected in the network, and associate them with specific biological processes. To do so, we will make use of community detection techniques and new experimental data. For the human liver dataset one could similarly validate communities by exploring predicting diseaserelated biological information; see for example Song et al. (Reference Song, Cui, Gao and Cui2019), Chen et al. (Reference Chen, Yin, Qu and Huang2018), Li et al. (Reference Li, Wang, Jiang, Zhao, Feng, Zhao and Wang2019).
Finally, we have showcased our method on gene expression datasets from different organisms obtained using different techniques: microarrays (R. leguminosarum), RNASeq (yeast), and singlecell RNASeq (human). However, the methods that we have developed are general, and can also be used to construct networks in a vast range of domains, such as, for example, economics (Wang et al., Reference Wang, Xie and Stanley2018), neuroscience (Bernhardt et al., Reference Bernhardt, Chen, He, Evans and Bernasconi2011), climatology (Donges et al., Reference Donges, Zou, Marwan and Kurths2009), and indeed any discipline where networks are constructed from correlation data.
Acknowledgments
We thank Alison K East and Florian Klimm for their help and advice on how to analyze the datasets used in this manuscript. We thank Katharine Turner (ANU), Lyuba V. Bozhilova, Florian Klimm, and Malte D Luecken for fruitful discussions. In addition, we acknowledge support from COST Action CA15109, Keble College Oxford and the Keble Association, as well as the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work (http://dx.doi.org/10.5281/zenodo.22558).
Funding
This work is supported by the Engineering and Physical Sciences Research Council (EPSRC) [EP/R512333/1 to JPD, MBD, PSP, CMD, and GR, EP/T018445/1 to GR], the Biotechnology and Biological Sciences Research Council (BBSRC) [BB/T001801/1 to PSP and GR], the COSTNET COST Action [CA15109 to GR], and eTherapeutics plc. MBD acknowledges support from the OxfordEmirates Data Science Lab.
Competing interests
None.