To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
This chapter focuses on guidance in selecting the appropriate statistical test depending on what type of data is being analyzed. Remember, data can be continuous, binary, ordinal, nominal, normally distributed, non-normally distributed, log-distributed, and so on (Chapter 2). Decisions must be based on a full understanding of the kind of data you have and your analytic objective. Conduct your preliminary analyses! Plot your data! Look at your data! Do you have outliers, skewness, errors?
A contingency table is a table that shows the distribution of one variable within categories of another, like gender vs. disease/no disease. These tables can be 2 × 2, 2 × 3, 2 × 4 (if you were to examine gender by race, for example), 2 × 6, etc. The second variable can have two values (such as yes or no) or three or more values like race (White, African American, Asian, etc.). When examining a 2 × 2 table like disease by gender, one would test for statistical significance using chi-square (χ) analysis. However, like ANOVA (Section 4.3.2), when including a variable that has more than two categories, like race, you can run the χ statistic but there are so many resulting cells, you won’t really know where the statistical differences lie since you are examining so many categories at once.
This chapter serves as a navigation tool that informs the investigator on what statistical test to perform when the data are continuous, non-continuous, log-distributed, when one has time-to-event data, and it explains why these are proper tests. The reader may have to consult Chapter 1 again to look up data types and why it is important to know how their data are distributed, and particularly why it is important to harness this knowledge as they become more proficient in their statistical skills, so that they can comfortably assess what is proper or improper statistics when evaluating a manuscript for publication, or just to fully understand peer-reviewed methods. The chapter also provides numerous online tools that you may use free of charge, to conduct your own statistical tests.
This chapter serves as the foundation of understanding the underlying concepts related to statistics. This chapter should be read over and over and over again, as it is key to understanding the remainder of the book. Such important concepts as sample vs. population, data management, Central Limit Theorem, and when to use parametric vs. non-parametric procedures are introduced. It provides concise descriptions and examples of basic measures of central tendancy (mean, median, mode range, interquartile range, etc.), measures of disperion around the averages, description of continuous vs. non-continuous measures, normal vs. non-normal distributions, log distributions, confidence intervals, and when to use one-sided vs. two-sided P-values. Many example calculations of sample size and power are also described for a number of different test situations.
This chapter gives examples of very basic graphs and charts and how to read them. Such graphs and charts are used by clinicians and by laboratory personnel. One plot that may be relatively overlooked is the normal probability plot, which gives a visual snapshot of the distribution of data. It might be a simple way to non-statistically determine whether or not data are normally or non-normally distributed, or if they are bi- or tri-modal, or log-distributed.
This chapter briefly reviews basic measures of disease occurrence and methods used to stratify disease occurrence by any number of factors, such as age. The distinctions between disease rate and disease density are described. Investigators may refer back to Chapter 8 to understand how varying prevalence, sensitivity and specificity, and confidence intervals impact the tools used to measure disease burden.
For the measurement of flow-induced microrotations in flows utilizing the depolarization of phosphorescence anisotropy, suitable luminophores are crucial. The present work examines dyes of the xanthene family, namely Rhodamine B, Eosin Y and Erythrosine B. Both in solution and incorporated in particles, the dyes are examined regarding their luminescent lifetimes and their quantum yield. In an oxygen-rich environment at room temperature, all dyes exhibit lifetimes in the sub-microsecond range and a low intensity signal, making them suitable for sensing fast rotations with sensitive acquisition systems.
We analyse the behaviour of the Euclidean algorithm applied to pairs (g,f) of univariate nonconstant polynomials over a finite field $\mathbb{F}_{q}$ of q elements when the highest degree polynomial g is fixed. Considering all the elements f of fixed degree, we establish asymptotically optimal bounds in terms of q for the number of elements f that are relatively prime with g and for the average degree of $\gcd(g,f)$. We also exhibit asymptotically optimal bounds for the average-case complexity of the Euclidean algorithm applied to pairs (g,f) as above.
We introduce and study analogues of expander and hyperfinite graph sequences in the context of directed acyclic graphs, which we call ‘extender’ and ‘hypershallow’ graph sequences, respectively. Our main result is a probabilistic construction of non-hypershallow graph sequences.
Distinguishing between continuous and first-order phase transitions is a major challenge in random discrete systems. We study the topic for events with recursive structure on Galton–Watson trees. For example, let $\mathcal{T}_1$ be the event that a Galton–Watson tree is infinite and let $\mathcal{T}_2$ be the event that it contains an infinite binary tree starting from its root. These events satisfy similar recursive properties: $\mathcal{T}_1$ holds if and only if $\mathcal{T}_1$ holds for at least one of the trees initiated by children of the root, and $\mathcal{T}_2$ holds if and only if $\mathcal{T}_2$ holds for at least two of these trees. The probability of $\mathcal{T}_1$ has a continuous phase transition, increasing from 0 when the mean of the child distribution increases above 1. On the other hand, the probability of $\mathcal{T}_2$ has a first-order phase transition, jumping discontinuously to a non-zero value at criticality. Given the recursive property satisfied by the event, we describe the critical child distributions where a continuous phase transition takes place. In many cases, we also characterise the event undergoing the phase transition.
Between 19 May and 12 June 2020, employees of the UZ Brussel were recruited in this study aiming to document the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) seroprevalence, to investigate the potential work-related risk factors for SARS-CoV-2 infection and to estimate the proportion of asymptomatic infections. In total, 2662 participants were included of whom 7.4% had immunoglobulin G antibodies against SARS-CoV-2. Of the participants reporting a positive polymerase chain reaction for SARS-CoV-2, 89% had antibodies at the time of blood sampling. Eleven per cent of the antibody positive participants reported no recent symptoms suggestive of coronavirus disease 2019 (COVID-19). Participants reporting fever, chest pain and/or anosmia/ageusia were significantly more frequently associated with the presence of antibodies against SARS-CoV-2. The presence of antibodies was highest in the group that had had contact with COVID-19-infected individuals outside the hospital with or without using appropriate personnel protective equipment (PPE) (P < 0.001). Inside the hospital, a statistically significant difference was observed for the employees considered as low-risk exposure compared to the intermediate-risk exposure group (P = 0.005) as well as the high-risk exposure group compared to the intermediate exposure risk group (P < 0.001). These findings highlight the importance of using correct PPE.
Like a hydra, fraudsters adapt and circumvent increasingly sophisticated barriers erected by public or private institutions. Among these institutions, banks must quickly take measures to avoid losses while guaranteeing the satisfaction of law-abiding customers. Facing an expanding flow of operations, effective banking relies on data analytics to support established risk control processes, but also on a better understanding of the underlying fraud mechanism. In addition, fraud being a criminal offence, the evidential aspect of the process must also be considered. These legal, operational, and strategic constraints lead to compromises on the means to be implemented for fraud management. This paper first focuses on the translation of practical questions raised in the banking industry at each step of the fraud management process into performance evaluation required to design a fraud detection model. Secondly, it considers a range of machine learning approaches that address these specificities: the imbalance between fraudulent and nonfraudulent operations, the lack of fully trusted labels, the concept-drift phenomenon, and the unavoidable trade-off between accuracy and interpretability of detection. This state-of-the-art review sheds some light on a technology race between black box machine learning models improved by post-hoc interpretation and intrinsic interpretable models boosted to gain accuracy. Finally, it discusses how concrete and promising hybrid approaches can provide pragmatic, short-term answers to banks and policy makers without swallowing up stakeholders with economical and ethical stakes in this technological race.
Catastrophe insurance markets fail to provide sufficient protections against natural catastrophes, whereas they have the capacity to absorb the losses. In this paper, we assume the catastrophic risks are dependent and extremely heavy-tailed, and insurers have limited liability to cover losses up to a certain amount. We provide a comprehensive study to show that the diversification in the catastrophe insurance markets can be transited from suboptimal to preferred by increasing the number of insurers in the market. This highlights the importance of coordination among insurers and the government intervention in encouraging insurers to participate in the catastrophe insurance market to exploit risk sharing. Simulation studies are provided to illuminate the key findings of our results.
This paper shows that if the errors in a multiple regression model are heavy-tailed, the ordinary least squares (OLS) estimators for the regression coefficients are tail-dependent. The tail dependence arises, because the OLS estimators are stochastic linear combinations of heavy-tailed random variables. Moreover, tail dependence also exists between the fitted sum of squares (FSS) and the residual sum of squares (RSS), because they are stochastic quadratic combinations of heavy-tailed random variables.
We consider a competition involving $r$ teams, where each individual game involves two teams, and where each game between teams $i$ and $j$ is won by $i$ with probability $P_{i,j} = 1 - P_{j,i}$. We suppose that $i$ and $j$ are scheduled to play $n(i,j)$ games and say that the team that wins the most games is the winner of the competition. We show that the conditional probability that $i$ is the winner, given that $i$ wins $k$ games, is increasing in $k$. We bound the tail probability of the number of wins of the winning team. We consider the special case where $P_{i,j} = {v_i}/{(v_i + v_j)}$, and obtain structural results on the probability that team $i$ is the winner. We give efficient simulation approaches for computing the probability that team $i$ is the winner, and the conditional probability given the number of wins of $i$.