To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Scientists must be ethical and conscientious, always. Data bring with them much promise to improve our understanding of the world around us, and improve our lives within it. But there are risks as well. Scientists must understand the potential harms of their work, and follow norms and standards of conduct to mitigate those concerns. But network data are different. As we discuss in this chapter, network data are some of the most important but also most sensitive data. Before we dive into the data, we discuss the ethics of data science in general and network data in specific. The ethical issues that we face often do not have clear solutions but require thoughtful approaches and understanding complex contexts and difficult circumstances.
In this chapter, we discuss how to represent network data inside a computer, with some examples of computational tasks and the data structures that enable those computations. When working with network data using code, you have many choices of data structures---but which ones are best for our given goals? Writing your own code to process network data can be valuable, yet existing libraries, which feature extensively-tested and efficiently-engineered functionalities, are worth considering as well. Python and R, both excellent programming languages for data science, come well-equipped with third-party libraries for working with network data, and we describe some examples. We also discuss choosing and using typical file formats for storing network data, as many standard formats exist.
Network data, like all data, are imperfect measures of objects of study. There may be missing information or false information. For networks, these measurement errors can lead to missing nodes or links (network elements that exist in reality but are absent from the network data) or spurious nodes or links (nodes or links present in the data but absent in reality). More troubling is that these conditions exist in a continuum, and there is a spectrum of scenarios where nodes or links may exist but not be meaningful in some way. In this chapter, we describe how such errors can appear and affect network data and introduce some ways to handle such errors in the data processing steps. Fixes for errors can lead to different networks, before and after processing, for example, and we must be careful and circumspect in identifying and planning for such errors.
What are the nodes? What are the links? These questions are not the start of your work—the upstream task makes sure of that—but they are an inflection point. Keep them front of mind. Your methods, the paths you take to analyze and interrogate your data, all unfold from the answers (plural!) to these questions. This chapter reflects on where we have gone, where we can go for more, and, perhaps, what the future has in store for data science, networks and network data.
Machine learning has revolutionized many fields, including science, healthcare, and business. It is also widely used in network data analysis. This chapter provides an overview of machine learning methods and how they can be applied to network data. Machine learning can be used to clean, process, and analyze network data, as well as make predictions about networks and network attributes. Methods that transform networks into meaningful representations are especially useful for specific network prediction tasks, such as classifying nodes and predicting links. The challenges of using machine learning with network data include recognizing data leakage and detecting dataset shift. As with all machine learning, effective use of machine learning on networks depends on practicing good data hygiene when evaluating a predictive model’s performance.
Some networks, many in fact, vary with time. They may grow in size, gaining nodes and links. Or they may shrink, losing links and becoming sparser over time. Sitting behind many networks are drivers that change the structure, predictably or not, leading to dynamic networks that exhibit all manner of changes. This chapter focuses on describing and quantifying such dynamic networks, recognizing the challenges that dynamics bring, and finding ways to address those challenges. We show how to represent dynamic networks in different ways, how to devise null models for dynamic networks, and how to compare and contrast dynamical processes running on top of the network against a network structure that is itself dynamic. Dynamic network data also brings practical issues, and we discuss working with date and time data and file formats.
In this chapter, we explore several important statistical models. Statistical models allow us to perform statistical inference—the process of selecting models and making predictions about the underlying distributions—based on the data we have. Many approaches exist, from the stochastic block model and its generalizations to the edge observer model, the exponential random graph model, and the graphical LASSO. As we show in this chapter, such models help us understand our data, but using them may at times be challenging, either computationally or mathematically. For example, the model must often be specified with great care, lest it seize on a drastically unexpected network property or fall victim to degeneracy. Or the model must make implausibly strong assumptions, such as conditionally independent edges, leading us to question its applicability to our problem. Or even our data may be too large for the inference method to handle efficiently. As we discuss, the search continues for better, more tractable statistical models and more efficient, more accurate inference algorithms for network data.
This chapter discusses the Fourier series representation for continuous-time signals. This is applicable to signals which are either periodic or have a finite duration. The connections between the continuous-time Fourier transform (CTFT), the discrete-time Fourier transform (DTFT), and Fourier series are also explained. Properties of Fourier series are discussed and many examples presented. For real-valued signals it is shown that the Fourier series can be written as a sum of a cosine series and a sine series; examples include rectified cosines, which have applications in electric power supplies. It is shown that the basis functions used in the Fourier series representation satisfy an orthogonality property. This makes the truncated version of the Fourier representation optimal in a certain sense. The so-called principal component approximation derived from the Fourier series is also discussed. A detailed discussion of the properties of musical signals in the light of Fourier series theory is presented, and leads to a discussion of musical scales, consonance, and dissonance. Also explained is the connection between Fourier series and the function-approximation property of multilayer neural networks, used widely in machine learning. An overview of wavelet representations and the contrast with Fourier series representations is also given.
This chapter examines discrete-time LTI systems in detail. It shows that the input–output behavior of an LTI system is characterized by the so-called impulse response. The output is shown to be the so-called convolution of the input with the impulse response. It is then shown that exponentials are eigenfunctions of LTI systems. This property leads to the ideas of transfer functions and frequency responses for LTI systems. It is argued that the frequency response gives a systematic meaning to the term “filtering.” Image filtering is demonstrated with examples. The discrete-time Fourier transform (DTFT) is introduced to describe the frequency domain behavior of LTI systems, and allows one to represent a signal as a superposition of single-frequency signals (the Fourier representation). DTFT is discussed in detail, with many examples. The z-transform, which is of great importance in the study of LTI systems, is also introduced and its connection to the Fourier transform explained. Attention is also given to real signals and real filters, because of their additional properties in the frequency domain. Homogeneous time-invariant (HTI) systems are also introduced. Continuous-time counterparts of these topics are explained. B-splines, which arise as examples in continuous-time convolution, are presented.
This chapter discusses many interesting properties of bandlimited signals. The subspace of bandlimited signals is introduced. It is shown that uniformly shifted versions of an appropriately chosen sinc function constitute an orthogonal basis for this subspace. It is also shown that the integral and the energy of a bandlimited signal can be obtained exactly from samples if the sampling rate is high enough. For non-bandlimited functions, such a result is only approximately true, with the approximation getting better as the sampling rate increases. A number of less obvious consequences of these results are also presented. Thus, well-known mathematical identities are derived using sampling theory. For example, the Madhava–Leibniz formula for the approximation of π can be derived like this. When samples of a bandlimited signal are contaminated with noise, the reconstructed signal is also noisy. This noise depends on the reconstruction filter, which in general is not unique. Excess bandwidth in this filter increases the noise, and this is quantitatively analyzed. An interesting connection between bandlimited signals and analytic functions (entire functions) is then presented. This has many implications, one being that bandlimited signals are infinitely smooth.
In working with network data, data acquisition is often the most basic yet the most important and challenging step. The availability of data and norms around data vary drastically across different areas and types of research. A team of biologists may spend more than a decade running assays to gather a cells interactome; another team of biologists may only analyze publicly available data. A social scientist may spend years conducting surveys of underrepresented groups. A computational social scientist may examine the entire network of Facebook. An economist may comb through large financial documents to gather tables of data on stakes in corporate holdings. In this chapter, we move one step along the network study life-cycle. Key to data gathering is good record-keeping and data provenance. Good data gathering sets us up for future success—otherwise, garbage in, garbage out—making it critical to ensure the best quality and most appropriate data is used to power your investigation.
This chapter covers ways to explore your network data using visual means and basic summary statistics, and how to apply statistical models to validate aspects of the data. Data analysis can generally be divided into two main approaches, exploratory and confirmatory. Exploratory data analysis (EDA) is a pillar of statistics and data mining and we can leverage existing techniques when working with networks. However, we can also use specialized techniques for network data and uncover insights that general-purpose EDA tools, which neglect the network nature of our data, may miss. Confirmatory analysis, on the other hand, grounds the researcher with specific, preexisting hypotheses or theories, and then seeks to understand whether the given data either support or refute the preexisting knowledge. Thus, complementing EDA, we can define statistical models for properties of the network, such as the degree distribution, or for the network structure itself. Fitting and analyzing these models then recapitulates effectively all of statistical inference, including hypothesis testing and Bayesian inference.
This chapter discusses the Fourier series representation for continuous-time signals. This is applicable to signals which are either periodic or have a finite duration. The connections between the continuous-time Fourier transform (CTFT), the discrete-time Fourier transform (DTFT), and Fourier series are also explained. Properties of Fourier series are discussed and many examples presented. For real-valued signals it is shown that the Fourier series can be written as a sum of a cosine series and a sine series; examples include rectified cosines, which have applications in electric power supplies. It is shown that the basis functions used in the Fourier series representation satisfy an orthogonality property. This makes the truncated version of the Fourier representation optimal in a certain sense. The so-called principal component approximation derived from the Fourier series is also discussed. A detailed discussion of the properties of musical signals in the light of Fourier series theory is presented, and leads to a discussion of musical scales, consonance, and dissonance. Also explained is the connection between Fourier series and the function-approximation property of multilayer neural networks, used widely in machine learning. An overview of wavelet representations and the contrast with Fourier series representations is also given.
Realistic networks are rich in information. Often too rich for all that information to be easily conveyed. Summarizing the network then becomes useful, often necessary, for communication and understanding but, being wary, of course, that a summary necessarily loses information about the network. Further, networks often do not exist in isolation. Multiple networks may arise from a given dataset or multiple datasets may each give rise to different views of the same network. In such cases and more, researchers need tools and techniques to compare and contrast those networks. In this chapter, In this chapter, well show you how to summarize a network, using statistics, visualizations, and even other networks. From these summaries we then describe ways to compare networks, defining a distance between networks for example. Comparing multiple networks using the techniques we describe can help researchers choose the best data processing options and unearth intriguing similarities and differences between networks in diverse fields.
This chapter introduces the discrete Fourier transform (DFT), which is different from the discrete-time Fourier transform (DTFT) introduced earlier. The DFT transforms an N-point sequence x[n] in the time domain to an N-point sequence X[k] in the frequency domain by sampling the DTFT of x[n]. A matrix representation for this transformation is introduced, and the properties of the DFT matrix are studied. The fast Fourier transform (FFT), which is a fast algorithm to compute the DFT, is also introduced. The FFT makes the computation of the Fourier transforms of large sets of data practical. The digital signal processing revolution of the 1960s was possible because of the FFT. This chapter introduces the simplest form of FFT, called the radix-2 FFT, and a number of its properties. The chapter also introduces circular or cyclic convolution, which has a special place in DFT theory, and explains the connection to ordinary convolution. Circular convolution paves the way for fast algorithms for ordinary convolution, using the FFT. The chapter also summarizes the relationships between the four types of Fourier transform studied in this book: CTFT, DTFT, DFT, and Fourier series.