To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
While there are cases where it is straightforward and unambiguous to define a network given data, often a researcher must make choices in how they define the network and that those choices, preceding most of the work on analyzing the network, have outsized consequences for that subsequent analysis. Sitting between gathering the data and studying the network is the upstream task: how to define the network from the underlying or original data. Defining the network precedes all subsequent or downstream tasks, tasks we will focus on in later chapters. Often those tasks are the focus of network scientists who take the network as a given and focus their efforts on methods using those data. Envision the upstream task by asking, what are the nodes? and what are the links?, with the network following from those definitions. You will find these questions a useful guiding star as you work, and you can learn new insights by reevaluating their answers from time to time.
Networks exhibit many common patterns. What causes them? Why are they present? Are they universal across all networks or only certain kinds of networks? One way to address these questions is with models. In this chapter, we explore in-depth the classic mechanistic models of network science. Random graph models underpin much of our understanding of network phenomena, from the small world path lengths to heterogeneous degree distributions and clustering. Mathematical tools help us understand what mechanisms or minimal ingredients may explain such phenomena, from basic heuristic treatments to combinatorial tools such as generating functions.
Network science is a broadly interdisciplinary field, pulling from computer science, mathematics, statistics, and more. The data scientist working with networks thus needs a broad base of knowledge, as network data calls for—and is analyzed with—many computational and mathematical tools. One needs good working knowledge in programming, including data structures and algorithms to effectively analyze networks. In addition to graph theory, probability theory is the foundation for any statistical modeling and data analysis. Linear algebra provides another foundation for network analysis and modeling because matrices are often the most natural way to represent graphs. Although this book assumes that readers are familiar with the basics of these topics, here we review the computational and mathematical concepts and notation that will be used throughout the book. You can use this chapter as a starting point for catching up on the basics, or as reference while delving into the book.
As we have seen, network data are necessarily imperfect. Missing and spurious nodes and edges can create uncertainty in what the observed data tell us about the original network. In this chapter, we dive deeper into tools that allow us to quantify such effects and probe more deeply into the nature of an unseen network from our observations. The fundamental challenge of measurement error in network data is capturing the error-producing mechanism accurately and then inferring the unseen network from the (imperfectly) observed data. Computational approaches can give us clues and insights, as can mathematical models. Mathematical models can also build up methods of statistical inference, whether in estimating parameters describing a model of the network or estimating the networks structure itself. But such methods quickly become intractable without taking on some possibly severe assumptions, such as edge independence. Yet, even without addressing the full problem of network inference, in this chapter, we show valuable ways to explore features of the unseen network, such as its size, using the available data.
In this chapter, we focus on statistics and measures that quantify a networks structure and characterize how it is organized. These measures have been central to much of network science, and a vast array of material is available to us, spanning across all scales of the network. The measures we discuss include general-purpose measures and those specialized to particular circumstances, which allow us to better get a handle on the network data. Network science has generated a dizzying array of valuable measures over the years. For example, we can measure local structures, motifs, patterns of correlations within the network, clusters and communities, hierarchy, and more. These measures are used for exploratory and confirmatory analyses, which we discussed in the previous chapter. With the measures of this chapter, we can understand the patterns in our networks, and using statistical models, we can put those patterns on a firm foundation.
Most scientists receive training in their domain of expertise but, with the possible exception of computer science, students of science receive little training in computer programming. While software engineering has brought forth sound principles for programming, training in software engineering only translates partially to scientific coding. Simply put, coding for science is not the same as coding for software. This chapter discusses best practices for writing correct, clear, and concise scientific code. We aim to ensure code is readable to others and supports data provenance, not hinders it. We also want the code to be a lasting recording of work performed, helping research reproducibility. Practices to address these concerns that we cover include clear variable names and code comments, favoring simple code, carefully documenting program dependencies and inputs, and using version control and logging. Together, these practices will enable your code to work better and more reliably for yourself and your collaborators.
Network science has exploded in popularity since the late 1990s. But it flows from a long and rich tradition of mathematical and scientific understanding of complex systems. We can no longer imagine the world without evoking networks. And network data is at the heart of it. In this chapter, we set the stage by highlighting network sciences ancestry and the exciting scientific approaches that networks have enabled, followed by a tour of the basic concepts and properties of networks.
Much of the power of networks lies in their flexibility. Networks can successfully describe many different kinds of complex systems. These descriptions are useful in part because they allow us to organize data associated with the system in meaningful ways. These associated attributes and their connections to the network are often the key drivers behind new insights. For example, in a social network, these may be demographic features, such as the ages and occupations of members of a firm. In a protein interaction network, gene ontology terms may be gathered by biologists studying the human genome. We can gain insight by collecting data on those features and associating them with the network nodes or links. In this chapter, we study ways to associate data with the network elements, the nodes and links. We describe ways to gather and store these attributes, what analysis we can do using them, and the most crucial questions to ask about these attributes and their interplay with our networks.
Many tools exist to help scientists work computationally. In addition to general-purpose and domain-specific programming languages, a wide assortment of programs exist to accomplish specific tasks. We call attention to a number of tools in this chapter, focusing on good practices when using them, good practices computationally and good practices scientifically. Important computing tools for data scientists include computational notebooks, data pipelines and file transfer tools, UNIX-style operating systems, version control systems, and data backup systems. Of course, the world of computing moves fast, and tools are always coming and going, so we conclude with advice and a brief workflow to guide you through evaluating new tools.
Scientists must be ethical and conscientious, always. Data bring with them much promise to improve our understanding of the world around us, and improve our lives within it. But there are risks as well. Scientists must understand the potential harms of their work, and follow norms and standards of conduct to mitigate those concerns. But network data are different. As we discuss in this chapter, network data are some of the most important but also most sensitive data. Before we dive into the data, we discuss the ethics of data science in general and network data in specific. The ethical issues that we face often do not have clear solutions but require thoughtful approaches and understanding complex contexts and difficult circumstances.
In this chapter, we discuss how to represent network data inside a computer, with some examples of computational tasks and the data structures that enable those computations. When working with network data using code, you have many choices of data structures---but which ones are best for our given goals? Writing your own code to process network data can be valuable, yet existing libraries, which feature extensively-tested and efficiently-engineered functionalities, are worth considering as well. Python and R, both excellent programming languages for data science, come well-equipped with third-party libraries for working with network data, and we describe some examples. We also discuss choosing and using typical file formats for storing network data, as many standard formats exist.
Network data, like all data, are imperfect measures of objects of study. There may be missing information or false information. For networks, these measurement errors can lead to missing nodes or links (network elements that exist in reality but are absent from the network data) or spurious nodes or links (nodes or links present in the data but absent in reality). More troubling is that these conditions exist in a continuum, and there is a spectrum of scenarios where nodes or links may exist but not be meaningful in some way. In this chapter, we describe how such errors can appear and affect network data and introduce some ways to handle such errors in the data processing steps. Fixes for errors can lead to different networks, before and after processing, for example, and we must be careful and circumspect in identifying and planning for such errors.
What are the nodes? What are the links? These questions are not the start of your work—the upstream task makes sure of that—but they are an inflection point. Keep them front of mind. Your methods, the paths you take to analyze and interrogate your data, all unfold from the answers (plural!) to these questions. This chapter reflects on where we have gone, where we can go for more, and, perhaps, what the future has in store for data science, networks and network data.
Machine learning has revolutionized many fields, including science, healthcare, and business. It is also widely used in network data analysis. This chapter provides an overview of machine learning methods and how they can be applied to network data. Machine learning can be used to clean, process, and analyze network data, as well as make predictions about networks and network attributes. Methods that transform networks into meaningful representations are especially useful for specific network prediction tasks, such as classifying nodes and predicting links. The challenges of using machine learning with network data include recognizing data leakage and detecting dataset shift. As with all machine learning, effective use of machine learning on networks depends on practicing good data hygiene when evaluating a predictive model’s performance.
Some networks, many in fact, vary with time. They may grow in size, gaining nodes and links. Or they may shrink, losing links and becoming sparser over time. Sitting behind many networks are drivers that change the structure, predictably or not, leading to dynamic networks that exhibit all manner of changes. This chapter focuses on describing and quantifying such dynamic networks, recognizing the challenges that dynamics bring, and finding ways to address those challenges. We show how to represent dynamic networks in different ways, how to devise null models for dynamic networks, and how to compare and contrast dynamical processes running on top of the network against a network structure that is itself dynamic. Dynamic network data also brings practical issues, and we discuss working with date and time data and file formats.
In this chapter, we explore several important statistical models. Statistical models allow us to perform statistical inference—the process of selecting models and making predictions about the underlying distributions—based on the data we have. Many approaches exist, from the stochastic block model and its generalizations to the edge observer model, the exponential random graph model, and the graphical LASSO. As we show in this chapter, such models help us understand our data, but using them may at times be challenging, either computationally or mathematically. For example, the model must often be specified with great care, lest it seize on a drastically unexpected network property or fall victim to degeneracy. Or the model must make implausibly strong assumptions, such as conditionally independent edges, leading us to question its applicability to our problem. Or even our data may be too large for the inference method to handle efficiently. As we discuss, the search continues for better, more tractable statistical models and more efficient, more accurate inference algorithms for network data.