To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
In this chapter, we begin our dive into the fundamentals of network data. We delve deep into the strange world of networks by considering the friendship paradox, the apparently contradictory finding that most people (nodes) have friends (neighbors) who are more popular than themselves. How can this be? Where are all these friends coming from? We introduce network thinking to resolve this paradox. As we will see, It is due to constraints induced by the network structure: pick a node randomly and you are much more likely to land next to a high-degree node than on a high-degree node because high-degree nodes have many neighbors. This is unexpected, almost profoundly so; a local (node-level) view of a network will not accurately reflect the global network structure. This paradox highlights the care we need to take when thinking about networks and network data mathematically and practically.
Network studies follow an explicit form, from framing questions and gathering data, to processing those data and drawing conclusions. And data processing leads to new questions, leading to new data and so forth. Network studies follow a repeating lifecycle. Yet along the way, many different choices will confront the researcher, who must be mindful of the choices they are making with their data and the choices of tools and techniques they are using to study their data. In this chapter, we describe how studies of networks begin and proceed, the life-cycle of a network study
In this chapter, we introduce visualization techniques for networks, what problems we face, and solutions we use, to make those visualizations as effective as possible. Visualization is an essential tool for exploring network data, revealing patterns that may not be easily inferred from statistics alone. Although network visualization can be done in many ways, the most common approach is through two-dimensional node-link diagrams. Properly laying out nodes and choosing the mapping between network and visual properties is essential to create an effective visualization, which requires iteration and fine-tuning. For dense networks, filtering or aggregating the data may be necessary. Following an iterative, back-and-forth workflow is essential, trying different layout methods and filtering steps to show the networks structure best while keeping the original questions and goals in mind. Visualization is not always the endpoint of a network analysis but can also be a useful step in the middle of an exploratory data analysis pipeline, similar to traditional statistical visualization of non-network data.
This chapter covers data provenance or data lineage, the detailed history of how data was created and manipulated, as well as the process of ensuring the validity of such data by documenting the details of its origins and transformations. Data provenance is a central challenge when working with data. Computing helps but also hinders our ability to maintain records of our work with the data. The best science will result when we adopt strategies to carefully and consistently record and track the origin of data and any changes made along the way. For instance, we want to know where (by whom) a dataset was created and what was the process used to create it. Then, if there were any changes, such as fixing erroneous entries, we need to have a good record of such changes. With these goals in mind, we discuss best practices for tracking data provenance. While such practices generally take time and effort to implement, making them seem tedious in the short term, over time, your research will become more reliable, and you and your collaborators will be grateful.
All fields of science benefit from gathering and analyzing network data. This chapter summarizes a small portion of the ways networks are found in research fields thanks to increasing volumes of data and the computing resources needed to work with that data. Epidemiology, dynamical systems, materials science, and many more fields than we can discuss here, use networks and network data. Well encounter many more examples during the rest of this book.
While there are cases where it is straightforward and unambiguous to define a network given data, often a researcher must make choices in how they define the network and that those choices, preceding most of the work on analyzing the network, have outsized consequences for that subsequent analysis. Sitting between gathering the data and studying the network is the upstream task: how to define the network from the underlying or original data. Defining the network precedes all subsequent or downstream tasks, tasks we will focus on in later chapters. Often those tasks are the focus of network scientists who take the network as a given and focus their efforts on methods using those data. Envision the upstream task by asking, what are the nodes? and what are the links?, with the network following from those definitions. You will find these questions a useful guiding star as you work, and you can learn new insights by reevaluating their answers from time to time.
Networks exhibit many common patterns. What causes them? Why are they present? Are they universal across all networks or only certain kinds of networks? One way to address these questions is with models. In this chapter, we explore in-depth the classic mechanistic models of network science. Random graph models underpin much of our understanding of network phenomena, from the small world path lengths to heterogeneous degree distributions and clustering. Mathematical tools help us understand what mechanisms or minimal ingredients may explain such phenomena, from basic heuristic treatments to combinatorial tools such as generating functions.
Network science is a broadly interdisciplinary field, pulling from computer science, mathematics, statistics, and more. The data scientist working with networks thus needs a broad base of knowledge, as network data calls for—and is analyzed with—many computational and mathematical tools. One needs good working knowledge in programming, including data structures and algorithms to effectively analyze networks. In addition to graph theory, probability theory is the foundation for any statistical modeling and data analysis. Linear algebra provides another foundation for network analysis and modeling because matrices are often the most natural way to represent graphs. Although this book assumes that readers are familiar with the basics of these topics, here we review the computational and mathematical concepts and notation that will be used throughout the book. You can use this chapter as a starting point for catching up on the basics, or as reference while delving into the book.
As we have seen, network data are necessarily imperfect. Missing and spurious nodes and edges can create uncertainty in what the observed data tell us about the original network. In this chapter, we dive deeper into tools that allow us to quantify such effects and probe more deeply into the nature of an unseen network from our observations. The fundamental challenge of measurement error in network data is capturing the error-producing mechanism accurately and then inferring the unseen network from the (imperfectly) observed data. Computational approaches can give us clues and insights, as can mathematical models. Mathematical models can also build up methods of statistical inference, whether in estimating parameters describing a model of the network or estimating the networks structure itself. But such methods quickly become intractable without taking on some possibly severe assumptions, such as edge independence. Yet, even without addressing the full problem of network inference, in this chapter, we show valuable ways to explore features of the unseen network, such as its size, using the available data.
In this chapter, we focus on statistics and measures that quantify a networks structure and characterize how it is organized. These measures have been central to much of network science, and a vast array of material is available to us, spanning across all scales of the network. The measures we discuss include general-purpose measures and those specialized to particular circumstances, which allow us to better get a handle on the network data. Network science has generated a dizzying array of valuable measures over the years. For example, we can measure local structures, motifs, patterns of correlations within the network, clusters and communities, hierarchy, and more. These measures are used for exploratory and confirmatory analyses, which we discussed in the previous chapter. With the measures of this chapter, we can understand the patterns in our networks, and using statistical models, we can put those patterns on a firm foundation.
Most scientists receive training in their domain of expertise but, with the possible exception of computer science, students of science receive little training in computer programming. While software engineering has brought forth sound principles for programming, training in software engineering only translates partially to scientific coding. Simply put, coding for science is not the same as coding for software. This chapter discusses best practices for writing correct, clear, and concise scientific code. We aim to ensure code is readable to others and supports data provenance, not hinders it. We also want the code to be a lasting recording of work performed, helping research reproducibility. Practices to address these concerns that we cover include clear variable names and code comments, favoring simple code, carefully documenting program dependencies and inputs, and using version control and logging. Together, these practices will enable your code to work better and more reliably for yourself and your collaborators.
Network science has exploded in popularity since the late 1990s. But it flows from a long and rich tradition of mathematical and scientific understanding of complex systems. We can no longer imagine the world without evoking networks. And network data is at the heart of it. In this chapter, we set the stage by highlighting network sciences ancestry and the exciting scientific approaches that networks have enabled, followed by a tour of the basic concepts and properties of networks.
Much of the power of networks lies in their flexibility. Networks can successfully describe many different kinds of complex systems. These descriptions are useful in part because they allow us to organize data associated with the system in meaningful ways. These associated attributes and their connections to the network are often the key drivers behind new insights. For example, in a social network, these may be demographic features, such as the ages and occupations of members of a firm. In a protein interaction network, gene ontology terms may be gathered by biologists studying the human genome. We can gain insight by collecting data on those features and associating them with the network nodes or links. In this chapter, we study ways to associate data with the network elements, the nodes and links. We describe ways to gather and store these attributes, what analysis we can do using them, and the most crucial questions to ask about these attributes and their interplay with our networks.
Many tools exist to help scientists work computationally. In addition to general-purpose and domain-specific programming languages, a wide assortment of programs exist to accomplish specific tasks. We call attention to a number of tools in this chapter, focusing on good practices when using them, good practices computationally and good practices scientifically. Important computing tools for data scientists include computational notebooks, data pipelines and file transfer tools, UNIX-style operating systems, version control systems, and data backup systems. Of course, the world of computing moves fast, and tools are always coming and going, so we conclude with advice and a brief workflow to guide you through evaluating new tools.