To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
• How do Poisson models differ from traditional linear regression models?
• What are the distributional assumptions of the Poisson regression model? For any count model?
• What is the dispersion statistic? How is it calculated?
• What is the relationship of Poisson standard errors to the dispersion tatistic?
• What is apparent overdisperson? How do we deal with it?
• How can a synthetic Monte Carlo Poisson model be developed?
• How are Poisson coefficients and rate-parameterized coefficients interpreted?
• What are marginal effects, partial effects, and discrete change with respect to count models?
Poisson regression is fundamental to the modeling of count data. It was the first model specifically used to model counts, and it still stands at the base of the many types of count models available to analysts. However, it was emphasized in the last chapter that because of the Poisson distributional assumption of equidispersion, using a Poisson model on real study data is usually unsatisfactory. It is sometimes possible to make adjustments to the Poisson model that remedy the problem of under- or over dispersion, but unfortunately often this is not possible. In this chapter, which is central to the book, we look at the nature of Poisson regression and provide guidelines on how to construct, interpret, and evaluate Poisson models as to their fit. The majority of fit tests we use for a Poisson model will be applicable to the more advanced count models discussed later.
In this final chapter, I briefly discuss models that have been developed for data that are generally more complex than what we have thus far observed. The sections are meant to introduce you to these methods if you have not already read about them – or used them.
The following list shows the types of data situations with which you may be confronted, together with a type of model that can be used in such a situation (given in parentheses).
Types of Data and Problems Dealt with in This Chapter
• Data sets are small and count data highly unbalanced. (exact Poisson)
• Counts have been truncated or censored at the left, right (highest values), or middle areas of the distribution. (truncated and censored Poisson and NB models)
• The count response appears to have two or more components, each being generated by a different mechanism. (finite mixture model)
• One or more model predictors are ill-shaped, needing smoothing. (GAM smoothers)
• Counts are erratically distributed and do not appear to follow a parametric count distribution very well. (quantile count models)
• Data are longitudinal or clustered in nature, where observations are not independent. (panel models; e.g., GEE, GLMM)
• Data are nested in levels or are in a hierarchical structure. (multilevel models)
A very brief overview of the Bayesian modeling of count data will be presented in Section 9.8.
• Why hasn't the PIG model been widely used before this?
• What types of data are best modeled using a PIG regression?
• How do we model data with a very high initial peak and long right skew?
• How do we know whether a PIG is a better-fitted model than negative binomial or Poisson models?
Poisson Inverse Gaussian Model Assumptions
The Poisson inverse Gaussian (PIG) model is similar to the negative binomial model in that both are mixture models. The negative binomial model is a mixture of Poisson and gamma distributions, whereas the inverse Gaussian model is a mixture of Poisson and inverse Gaussian distributions.
Those of you who are familiar with generalized linear models will notice that there are three GLM continuous distributions: normal, gamma, and inverse Gaussian. The normal distribution is typically parameterized to a lognormal distribution when associated with count models, presumably because the log link forces the distribution to have only nonnegative values. The Poisson and negative binomial (both NB2 and NB1) models have log links. Recall that the negative binomial is a mixture of the Poisson and gamma distributions, with variances of μ and μ2/v, respectively. We inverted v so that there is a direct relationship between the mean, dispersion, and variance function. Likewise, the inverse Gaussian is a mixture of Poisson and inverse Gaussian distributions, with an inverse Gaussian variance of μ3Φ.
Modeling Count Data is written for the practicing researcher who has a reason to analyze and draw sound conclusions from modeling count data. More specifically, it is written for an analyst who needs to construct a count response model but is not sure how to proceed.
A count response model is a statistical model for which the dependent, or response, variable is a count. A count is understood as a nonnegative discrete integer ranging from zero to some specified greater number. This book aims to be a clear and understandable guide to the following points:
• How to recognize the characteristics of count data
• Understanding the assumptions on which a count model is based
• Determining whether data violate these assumptions (e.g., overdispersion), why this is so, and what can be done about it
• Selecting the most appropriate model for the data to be analyzed
• Constructing a well-fitted model
• Interpreting model parameters and associated statistics
• Predicting counts, rate ratios, and probabilities based on a model
• Evaluating the goodness-of-fit for each model discussed
There is indeed a lot to consider when selecting the best-fitted model for your data. I will do my best in these pages to clarify the foremost concepts and problems unique to modeling counts. If you follow along carefully, you should have a good overview of the subject and a basic working knowledge needed for constructing an appropriate model for your study data.
• What are some of the foremost tests to determine whether a Poisson model is overdispersed?
• What is scaling? What does it do to a count model?
• Why should robust standard errors be used as a default?
• What is a quasi-likelihood model?
This chapter can be considered a continuation of Chapter 2. Few real-life Poisson data sets are truly equidispersed. Overdispersion to some degree is inherent in the vast majority of Poisson data. Thus, the real question deals with the amount of overdispersion in a particular model – is it statistically sufficient to require a model other than Poisson? This is one of the foremost questions we address in this chapter, together with how we assess fit and then adjust for the lack of it.
Basics of Count Model Fit Statistics
Most statisticians consider overdispersion the key problem when considering count model fit. That is, when thinking of the fit of a count model, an analyst typically attempts to evaluate whether a count model is extradispersed – which usually means overdispersed. If there is evidence of overdispersion in a Poisson model, the problem then is to determine what gives rise to it. If we can determine the cause, we can employ the appropriate model to use on the data.
Analysts have used a variety of tests to determine whether the model they used on their data actually fits.
• What is the relationship between a probability distribution function (PDF) and a statistical model?
• What are the parameters of a statistical model? Where do they come from, and can we ever truly know them?
• How does a count model differ from other regression models?
• What are the basic count models, and how do they relate with one another?
• What is over dispersion, and why is it considered to be the fundamental problem when modeling count data?
What Are Counts?
When discussing the modeling of count data, it's important to clarify exactly what is meant by a count, as well as “count data” and “count variable.” The word “count” is typically used as a verb meaning to enumerate units, items, or events. We might count the number of road kills observed on a stretch of highway, how many patients died at a particular hospital within 48 hours of having a myocardial infarction, or how many separate sunspots were observed in March 2013. “Count data,” on the other hand, is a plural noun referring to observations made about events or items that are enumerated. In statistics, count data refer to observations that have only nonnegative integer values ranging from zero to some greater undetermined value. Theoretically, counts can range from zero to infinity, but they are always limited to some lesser distinct value – generally the maximum value of the count data being modeled.
Massive amounts of data on human beings can now be analyzed. Pragmatic purposes abound, including selling goods and services, winning political campaigns, and identifying possible terrorists. Yet 'big data' can also be harnessed to serve the public good: scientists can use big data to do research that improves the lives of human beings, improves government services, and reduces taxpayer costs. In order to achieve this goal, researchers must have access to this data - raising important privacy questions. What are the ethical and legal requirements? What are the rules of engagement? What are the best ways to provide access while also protecting confidentiality? Are there reasonable mechanisms to compensate citizens for privacy loss? The goal of this book is to answer some of these questions. The book's authors paint an intellectual landscape that includes legal, economic, and statistical frameworks. The authors also identify new practical approaches that simultaneously maximize the utility of data access while minimizing information risk.
The essays in this part of the book make powerful arguments for the value of data in the public sector. We are all aware of their value to the private sector; indeed, the market advantage of many of large companies in the United States, such as Google, Facebook, and Yahoo, lies in their access to large datasets on individual behavior, and their ability to turn data into privately held information. Yet the experience of the authors demonstrates that the gap between vision and reality in the public sector is large, for many reasons. The authors identify new approaches that can enable public sector custodians to combine and use data, and new approaches to enable researcher access so that data can be turned into publicly held information. A major leitmotif in each chapter is, of course, trust.
What is the vision? An illustrative, but not exhaustive, list identified by the authors of the potential and actual value of big data range from simply better, more targeted city management to reduced taxpayer cost and burden, from great transparency and less corruption to greater economic growth, and indeed to addressing problems of epidemics, climate change, and pollution. Interestingly, as Elias points out, the European Commission recognized the value of data as far back as 1950, when the European Convention on Human Rights noted, “There shall be no interference by a public authority with the exercise of this right [to privacy] except such as in accordance with the law and is necessary in a democratic society in the interest of national security, public safety or the economic well-being of the country, for the protection of disorder or crime, for the protection of health or morals, or the protection of the rights and freedoms of others”(emphasis added).
How do information privacy laws regulate the use of big data techniques, if at all? Do these laws strike an appropriate balance between allowing the benefits of big data and protecting individual privacy? If not, how might we amend or extend laws to better strike this balance?
This chapter attempts to answer questions like these. It builds on Chapter 1 of this volume, by Strandburg, which focused primarily on legal rules governing the collection of data. This chapter will focus primarily on the law of the United States, although it will make comparisons to the laws of other jurisdictions, especially the European Union, which is well covered in Chapter 8 of this volume.
Most information privacy law focuses on collection or disclosure and not use. Once data has been legitimately obtained, few laws dictate what may be done with the information. The exceptions to this general pattern receive attention below; laws that govern use tend to focus on particular types of users, especially users that lawmakers have deemed owe obligations of confidentiality to data subjects. For example, law regulating the health and financial industries, industries that historically have evolved obligations of confidentiality, constrain not only collection and disclosure but also use.
This chapter argues that our current information privacy laws are failing to protect individuals from harm. The discussion focuses primarily on shortcomings in the law that relate to specific features of big data, although it also describes a few shortcomings that relate only tangentially to these features. All of these shortcomings expose some individuals to the risk of harm in certain circumstances. We need to develop ways to amend the laws to recalibrate the balance between analytics and risk of harm. Ultimately, the chapter proposes five general approaches for change.
Propose. Break. Propose again. So pre-modern cryptography cycled. An encryption scheme was proposed; a cryptanalyst broke it; a modification, or even a completely new scheme, was proposed. Nothing ensured that the new scheme would in any sense be better than the old. Among the astonishing breakthroughs of modern cryptography is the methodology of rigorously defining the goal of a cryptographic primitive – what it means to break the primitive – and providing a clear delineation of the power – information, computational ability – of the adversary to be resisted (Goldwasser and Micali 1984; Goldwasser et al. 1988). Then, for any proposed method, one proves that no adversary of the specified class can break the primitive. If the class of adversaries captures all feasible adversaries, the scheme can be considered to achieve the stated goal.
This does not mean the scheme is invulnerable, as the goal may have been too weak to capture the full demands placed on the primitive. For example, when the cryptosystem needs to be secure against a passive eavesdropper the requirements are weaker than when the cryptosystem needs to be secure against an active adversary that can determine whether or not arbitrary ciphertexts are well formed (such an attack was successfully launched against PKCS#1; Bleichenbacher 1998). In this case the goal may be reformulated to be strictly more stringent than the original goal, and a new system proposed (and proved). This strengthening of the goal converts the propose–break–propose again cycle into a path of progress.