To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The benchmark measurement error model is the bivariate linear errors-invariables (EIV) regression model with additive measurement errors in both the dependent variable and the regressor variable. The measurement errors are assumed to be classical, meaning that they are uncorrelated with the true value and have mean zero. Then the OLS estimator is inconsistent, with a bias toward zero. The measurement error is often large enough for this bias to be substantial; see, for example, Bound, Brown, and Mathiowetz (2001).
For nonlinear models the attenuation result does not necessarily hold, but measurement error still leads to inconsistency because the identified parameter is not the parameter of interest in the model free of measurement error. The essential problem lies in the correlation between the observed regressor variable and the measurement error. This leads to loss of identification and distorted inferences about the role of the covariate. A key objective of analysis is to establish an identification strategy for the parameter of interest.
There are important differences between nonlinear and linear measurement errors models. It is more difficult in nonlinear models to correct for classical measurement error in the regressors. Furthermore, in nonlinear models it may be more natural to allow measurement errors to be nonclassical and nonadditive. And although in linear models classical measurement error in the dependent variable is innocuous because it just contributes to the equation error as additive noise, in nonlinear models the presence of even classical measurement error in the dependent variable leads to loss of identification of model parameters.
This book describes regression methods for count data, where the response variable is a non-negative integer. The methods are relevant for analysis of counts that arise in both social and natural sciences.
Despite their relatively recent origin, count data regression methods build on an impressive body of statistical research on univariate discrete distributions. Many of these methods have now found their way into major statistical packages, which has encouraged their application in a variety of contexts. Such widespread use has itself thrown up numerous interesting research issues and themes, which we explore in this book.
The objective of the book is threefold. First, we wish to provide a synthesis and integrative survey of the literature on count data regressions, covering both the statistical and econometric strands. The former has emphasized the framework of generalized linear models, exponential families of distributions, and generalized estimating equations, while the latter has emphasized nonlinear regression and generalized method of moment frameworks. Yet between them there are numerous points of contact which can be fruitfully exploited. Our second objective is to make sophisticated methods of data analysis more accessible to practitioners with different interests and backgrounds. To this end we consider models and methods suitable for cross-section, time series, and longitudinal data. Detailed analyses of several data sets as well as shorter illustrations, implemented from a variety of viewpoints, are scattered throughout the book to put empirical flesh on theoretical or methodological discussion. We draw on examples from, and give references to, works in many applied areas.
The most commonly used models for count regression, Poisson and negative binomial, were presented in Chapter 3. In this chapter we introduce richer models for count regression using cross-section data. For some of these models the conditional mean retains the exponential functional form. Then the Poisson QMLE and NB2 ML estimators remain consistent, although they may be inefficient and may not be suitable for predicting probabilities, rather than the conditional mean. For many of these models, however, the Poisson and NB2 estimators are inconsistent. Then alternative methods are used, ones that generally rely heavily on parametric assumptions.
One reason for the failure of the Poisson regression is that the Poisson process has unobserved heterogeneity that contributes additional randomness. This leads to mixture models, the negative binomial being only one example. A second reason is the failure of the Poisson process assumption and its replacement by a more general stochastic process.
Some common departures from the standard Poisson regression are as follows.
Failure of the mean-equals-variance restriction: Frequently the conditional variance of data exceeds the conditional mean, which is usually referred to as extra-Poisson variation or overdispersion relative to the Poisson model. Overdispersion may result from neglected or unobserved heterogeneity that is inadequately captured by the covariates in the conditional mean function. It is common to allow for random variation in the Poisson conditional mean by introducing a multiplicative error term. This leads to families of mixed Poisson models.
In this chapter we provide a detailed discussion of empirical models for three examples based on four cross-sectional data sets. The first example analyzes the demand for medical care by the elderly in United States and shares many features of health utilization studies based on cross-section data. The second example is an analysis of recreational trips. The third is an analysis of completed fertility – the total number of children born to a woman with a complete history of births.
Figure 6.1 presents histograms for the four count variables studied; the first two histograms exclude the highest percentile for readability. Physician visits appear roughly negative binomial, with a mild excess of zeros. Recreational trips have a very large excess of zeroes. Completed fertility in both cases is bimodal, with modes at 0 and 2. Different count data models will most likely be needed for these different datasets.
The applications presented in this chapter emphasize fully parametric models for counts, an issue discussed in section 6.2. Sections 6.3 to 6.5 deal, in turn, with each of the three empirical applications. The health care example in section 6.3 is the most extensive example and provides a lengthy treatment of model fitting, selecting, and interpreting, with focus on a finite mixture model. The recreational trips example in section 6.4 pays particular attention to special treatment of zero counts versus positive counts. The completed fertility illustration in section 6.5 is a nonregression example that emphasizes fitting a distribution that is bimodal. Section 6.6 pursues amethodological question concerning the distribution of the LR test under nonstandard conditions, previously raised in Chapter 4.8.5.
Count regressions with endogenous regressors occur frequently. Ignoring the feedback from the response variable to the endogenous regressor, and simply conditioning the outcome on variables with which it is jointly determined, leads in general to inconsistent parameter estimates. The estimation procedure should instead allow for stochastic dependence between the response variable and endogenous regressors. In considering this issue the existing literature on simultaneous equation estimation in nonlinear models is of direct relevance (T. Amemiya, 1985).
The empirical example of Chapter 3 models doctor visits as depending in part on the individual's type of health insurance. In Chapter 3 the health insurance indicator variables were treated as exogenous, but health insurance is frequently a choice variable rather than exogenously assigned. A richer model is a simultaneous model with a count outcome depending on endogenous variable(s) that may be binary (two insurance plans), multinomial (more than two insurance plans), or simply continuous.
This chapter deals with several classes of models with endogenous regressors, tailored to the outcome of interest being a count. It discusses estimation and inference for both fully parametric full-information methods and less parametric limited-information methods. These approaches are based on a multiple equation model in which that for the count outcome is of central interest, but there is also an auxiliary model for the endogenous regressor, sometimes called the first-stage or reduced-form equation. Estimation methods differ according to the detail in which the reduced form is specified and exploited in estimation.
Longitudinal data or panel data are observations on a cross-section of individual units such as persons, households, firms, and regions that are observed over several time periods. The data structure is similar to that of multivariate data considered in Chapter 8. Analysis is simpler than for multivariate data because for each individual unit the same outcome variable is observed, rather than several different outcome variables. Yet analysis is also more complex because this same outcome variable is observed at different points in time, introducing times series data considerations presented in Chapter 7.
In this chapter we consider longitudinal data analysis if the dependent variable is a count variable. Remarkably, many count regression applications are to longitudinal data rather than simpler cross-section data. Econometrics examples include the number of patents awarded to each of many individual firms over several years, the number of accidents in each of several regions, and the number of days of absence for each of many persons over several years. A political science example is the number of protests in each of several different countries over many years. A biological and health science example is the number of occurrences of a specific health event, such as a seizure, for each of many patients in each of several time periods.
This chapter is intended to provide a self-contained treatment of basic crosssection count data regression analysis. It is analogous to a chapter in a standard statistics text that covers both homoskedastic and heteroskedastic linear regression models.
The most commonly used count models are Poisson and negative binomial. For readers interested only in these models, it is sufficient to read sections 3.1 to 3.5, along with preparatory material in sections 1.2 and 2.2.
As indicated in Chapter 2, the properties of an estimator vary with the assumptions made on the dgp. By correct specification of the conditional mean or variance or density, we mean that the functional form and explanatory variables in the specified conditional mean or variance or density are those of the dgp.
The simplest regression model for count data is the Poisson regression model. For the Poisson MLE, the following can be shown:
Consistency requires correct specification of the conditional mean. It does not require that the dependent variable y be Poisson distributed.
Valid statistical inference using default computed maximum likelihood standard errors and t statistics requires correct specification of both the conditional mean and variance. This requires equidispersion, that is, equality of conditional variance and mean, but not Poisson distribution for y.
Valid statistical inference using appropriately computed standard errors is still possible if data are not equidispersed, provided the conditional mean is correctly specified.
Bayesian methods provide a quite different way to view statistical inference and model selection and to incorporate prior information on model parameters. These methods have become increasingly popular over the past 20 years due to methodological advances, notably Markov chain Monte Carlo methods, and increased computational power.
Some applied studies in econometrics are fully Bayesian. Others merely use Bayesian methods as a tool to enable statistical inference in the classical frequentist maximum likelihood framework for likelihood-based models that are difficult to estimate using other methods such as simulated maximum likelihood.
Section 12.2 presents the basics of Bayesian analysis. Section 12.3 presents some results for Poisson models. Section 12.4 covers Markov chain Monte Carlo methods that are now the common way to implement Bayesian analysis when analytically tractable results cannot be obtained, and it provides an illustrative example. Section 12.5 summarizes Bayesian models for various types of count data. Section 12.6 concludes with a more complicated illustrative example, a count version of the Roy model that allows for endogenous selection.
BAYESIAN APPROACH
The Bayesian approach treats the parameters θ as unknown random variables, with inference on θ to be based both on the data y and on prior beliefs about θ. The data and prior beliefs are combined to form the posterior density of θ given y, and Bayesian inference is based on this posterior. This section presents a brief summary, with further details provided in subsequent sections.
This chapter presents the general modeling approaches most often used in count data analysis – likelihood-based, generalized linear models, and moment-based. Statistical inference for these nonlinear regression models is based on asymptotic theory, which is also summarized.
The models and results vary according to the strength of the distributional assumptions made. Likelihood-based models and the associated maximum likelihood estimator require complete specification of the distribution. Statistical inference is usually performed under the assumption that the distribution is correctly specified.
A less parametric analysis assumes that some aspects of the distribution of the dependent variable are correctly specified, whereas others are not specified or, if they are specified, are potentially misspecified. For count data models, considerable emphasis has been placed on analysis based on the assumption of correct specification of the conditional mean or of correct specification of both the conditional mean and the conditional variance. This is a nonlinear generalization of the linear regression model, in which consistency requires correct specification of the mean and efficient estimation requires correct specification of the mean and variance. It is a special case of the class of generalized linear models that is widely used in the statistics literature. Estimators for generalized linear models coincide with maximum likelihood estimators if the specified density is in the linear exponential family. But even then the analytical distribution of the same estimator can differ across the two approaches if different second moment assumptions are made.
The truly world-wide reach of the Web has brought with it a new realisation of the enormous importance of usability and user interface design. In the last ten years, much has become understood about what works in search interfaces from a usability perspective, and what does not. Researchers and practitioners have developed a wide range of innovative interface ideas, but only the most broadly acceptable make their way into major web search engines. This book summarizes these developments, presenting the state of the art of search interface design, both in academic research and in deployment in commercial systems. Many books describe the algorithms behind search engines and information retrieval systems, but the unique focus of this book is specifically on the user interface. It will be welcomed by industry professionals who design systems that use search interfaces as well as graduate students and academic researchers who investigate information systems.
The world is awash with digital data from social networks, blogs, business, science and engineering. Data-intensive computing facilitates understanding of complex problems that must process massive amounts of data. Through the development of new classes of software, algorithms and hardware, data-intensive applications can provide timely and meaningful analytical results in response to exponentially growing data complexity and associated analysis requirements. This emerging area brings many challenges that are different from traditional high-performance computing. This reference for computing professionals and researchers describes the dimensions of the field, the key challenges, the state of the art and the characteristics of likely approaches that future data-intensive problems will require. Chapters cover general principles and methods for designing such systems and for managing and analyzing the big data sets of today that live in the cloud and describe example applications in bioinformatics and cybersecurity that illustrate these principles in practice.
With sensors becoming ubiquitous, there is an increasing interest in mining the data from these sensors as the data are being collected. This analysis of streaming data, or data streams, is presenting new challenges to analysis algorithms. The size of the data can be massive, especially when the sensors number in the thousands and the data are sampled at a high frequency. The data can be non-stationary, with statistics that vary over time. Real-time analysis is often required, either to avoid untoward incidents or to understand an interesting phenomenon better. These factors make the analysis of streaming data, whether from sensors or other sources, very data- and compute-intensive. One possible approach to making this analysis tractable is to identify the important data streams to focus on them. This chapter describes the different ways in which this can be done, given that what makes a stream important varies from problem to problem and can often change with time in a single problem. The following illustrate these techniques by applying them to data from a real problem and discuss the challenges faced in this emerging field of streaming data analysis.
This chapter is organized as follows: first, I define what is meant by streaming data and use examples from practical problems to discuss the challenges in the analysis of these data. Next, I describe the two main approaches used to handle the streaming nature of the data – the sliding window approach and the forgetting factor approach.
Protecting communications networks against attacks where the aim is to steal information, disrupt order, or harm critical infrastructure can require the collection and analysis of staggering amounts of data. The ability to detect and respond to threats quickly is a paramount concern across sectors, and especially for critical government, utility, and financial networks. Yet detecting emerging or incipient threats in immense volumes of network traffic requires new computational and analytic approaches. Network security increasingly requires cooperation between human analysts able to spot suspicious events through means such as data visualization and automated systems that process streaming network data in near real-time to triage events so that human analysts are best able to focus their work.
This chapter presents a pair of network traffic analysis tools coupled to a computational architecture that enables the high-throughput, real-time visual analysis of network activity. The streaming data pipeline towhich these tools are connected is designed to be easily extensible, allowing newtools to subscribe to data and add their own in-stream analytics. The visual analysis tools themselves – Correlation Layers for Information Query and Exploration (CLIQUE) and Traffic Circle – provide complementary views of network activity designed to support the timely discovery of potential threats in volumes of network data that exceed what is traditionally visualized. CLIQUE uses a behavioral modeling approach that learns the expected activity of actors (such as IP addresses or users) and collections of actors on a network, and compares current activity to this learned model to detect behavior-based anomalies.
Support vector machines (SVM) are currently one of the most popular and accurate methods for binary data classification and prediction. They have been applied to a variety of data and situations such as cyber-security, bioinformatics, web searches, medical risk assessment, financial analysis, and other areas [1]. This type of machine learning is shown to be accurate and is able to generalize predictions based upon previously learned patterns. However, current implementations are limited in that they can only be trained accurately on examples numbering to the tens of thousands and usually run only on serial computers. There are exceptions. A prime example is the annual machine learning and classification competitions such as the International Conference on Artificial Neural Networks (ICANN), which present problems with more than 100,000 elements to be classified. However, in order to treat such large test cases the formalism of the support vector machines must be modified.
SVMs were first developed by Vapnik and collaborators [2] as an extension to neural networks. Assume that we can convert the data values associated with an entity into numerical values that form a vector in the mathematical sense. These vectors form a space. Also, assume that this space of vectors can be separated by a hyperplane into the vectors that belong to one class and those that form the opposing class.