To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
For $\ell \geq 3$, an $\ell$-uniform hypergraph is disperse if the number of edges induced by any set of $\ell +1$ vertices is 0, 1, $\ell$, or $\ell +1$. We show that every disperse $\ell$-uniform hypergraph on $n$ vertices contains a clique or independent set of size $n^{\Omega _{\ell }(1)}$, answering a question of the first author and Tomon. To this end, we prove several structural properties of disperse hypergraphs.
We investigate the properties of the linear two-way fixed effects (FE) estimator for panel data when the underlying data generating process (DGP) does not have a linear parametric structure. The FE estimator is consistent for some pseudo-true value and we characterize the corresponding asymptotic distribution. We show that the rate of convergence is determined by the degree of model misspecification, and that the asymptotic distribution can be non-normal. We propose a novel autoregressive double adaptive wild (AdaWild) bootstrap procedure applicable for a large class of DGPs. Monte Carlo simulations show that it performs well for panels of small and moderate dimensions. We use data from U.S. manufacturing industries to illustrate the benefits of our procedure.
We describe the asymptotic behaviour of large degrees in random hyperbolic graphs for all values of the curvature parameter $\alpha$. We prove that, with high probability, the node degrees satisfy the following ordering property: the ranking of the nodes by decreasing degree coincides with the ranking of the nodes by increasing distance to the centre, at least up to any constant rank. In the sparse regime $\alpha>\tfrac{1}{2}$, the rank at which these two rankings cease to coincide is $n^{1/(1+8\alpha)+o(1)}$. We also provide a quantitative description of the large degrees by proving the convergence in distribution of the normalised degree process towards a Poisson point process. In particular, this establishes the convergence in distribution of the normalised maximum degree of the graph. A transition occurs at $\alpha = \tfrac{1}{2}$, which corresponds to the connectivity threshold of the model. For $\alpha < \tfrac{1}{2}$, the maximum degree is of order $n - O(n^{\alpha + 1/2})$, whereas for $\alpha \geq \tfrac{1}{2}$, the maximum degree is of order $n^{1/(2\alpha)}$. In the $\alpha < \tfrac{1}{2}$ and $\alpha > \tfrac{1}{2}$ cases, the limit distribution of the maximum degree belongs to the class of extreme value distributions (Weibull for $\alpha < \tfrac{1}{2}$ and Fréchet for $\alpha > \tfrac{1}{2}$). This refines previous estimates on the maximum degree for $\alpha > \tfrac{1}{2}$ and extends the study of large degrees to the dense regime $\alpha \leq \tfrac{1}{2}$.
This study investigates unintended information flow in large language models (LLMs) by proposing a computational linguistic framework for detecting and analyzing domain anchorage. Domain anchorage is a phenomenon potentially caused by in-context learning or latent “cache” retention of prior inputs, which enables language models to infer and reinforce shared latent concepts across interactions, leading to uniformity in responses that can persist across distinct users or prompts. Using GPT-4 as a case study, our framework systematically quantifies the lexical, syntactic, semantic, and positional similarities between inputs and outputs to detect these domain anchorage effects. We introduce a structured methodology to evaluate the associated risks and highlight the need for robust mitigation strategies. By leveraging domain-aware analysis, this work provides a scalable framework for monitoring information persistence in LLMs, which can inform enterprise guardrails to ensure response consistency, privacy, and safety in real-world deployments.
We contribute to the recent debate on the instability of the slope of the Phillips curve by offering insights from a flexible time-varying instrumental variable (IV) approach robust to weak instruments. Our robust approach focuses directly on the Phillips curve and allows general forms of instability, in contrast to current approaches based either on structural models with time-varying parameters or IV estimates in ad-hoc sub-samples. We find evidence of a weakening of the slope of the Phillips curve starting around 1980. We also offer novel insights on the Phillips curve during the recent pandemic: The flattening has reverted and the Phillips curve is back.
Social relationships provide opportunities to exchange and obtain health advice. Not only close confidants may be perceived as sources of health advice, but also acquaintances met in places outside a closed circle of family and friends, e.g., in voluntary organizations. This study is the first to analyze the structure of complete health advice networks in three voluntary organizations and compare them with more commonly studied close relationships. To this end, we collected data on multiple networks and health outcomes among 143 middle-aged and older adults (mean age = 53.9 years) in three carnival clubs in Germany. Our analyses demonstrate that perceived health advice and close relationships overlap only by 34%. Moreover, recent advances in exponential random graph models (ERGMs) allow us to illustrate that the network structure of perceived health advice differs starkly from that of close relationships. For instance, we found that advice networks exhibited lower transitivity and greater segregation by gender and age in comparison to networks of close relationships. We also found that actors with poor physical health perceive less individuals as health advisors than those with good physical health. Our findings suggest that community settings, such as voluntary associations, provide a unique platform for exchanging health advice and information among both close and distant network members.
Choose the type of multivariable model based on the type of outcome variable you have. Perform univariate statistics to understand the distribution of your independent and outcome variables. Perform bivariate analysis of your independent variables. Run a correlation matrix to understand how your independent variables are related to another. Assess your missing data. Perform your analysis and assess how well your model fits the data. Assess the strength of your individual covariates in estimating outcome. Use regression diagnostics to assess the underlying assumptions of your model. Perform sensitivity analyses to assess the robustness of your findings and consider whether it would be possible to validate your model. Publish your work and soak up the glory.
Performing the analysis: a series of designed steps ensuring you are entering the correct information into the model. A useful convention to code dichotomous variables: assign “1” to presence and “0” to absence of condition; the variable’s mean will be equal to the condition’s prevalence. The reference group choice will not affect results but will affect how results are reported. Choose your reference category based on the main hypothesis. Interaction terms are entered by creating a product term: a variable whose value is the product of two independent variables. For proportional hazards or other survival time, enter: starting time, outcome of interest’s date, censor date.
Variable selection techniques: automatic procedures determining which independent variables will be included in a model. It is usually better for the investigator to decide what variables should be in the model rather than using a statistical algorithm.
Working with a biostatistician should be an iterative process. Especially with complicated studies, consult them at each analysis phase. For conducting the analysis, use the stat package your research group uses so you will always be able to get help when needed.
Propensity scores are a statistical method for adjusting for baseline differences between study groups. The scores are based on the probability of a subject being in a particular group, conditional on that subject’s values on those independent variables though to influence group membership. Propensity scores with multivariable analysis produces a better adjustment for baseline differences than simply including potential confounders in a multivariable model predicting outcome. Propensity scores are also particularly helpful when outcomes are rare and the proportions of subjects in the independent groups are relatively equal. Another advantage of propensity scores is that they make no assumptions about the relationship between the individual confounders and outcome. The adequacy of a propensity score is judged by whether there is sufficient overlap between the groups and whether it balances the covariates.
There are four major ways you can use propensity scores: matching, weighting, stratified, as a covariate in a model predicting outcome.
The choice of multivariable model depends primarily on the type of outcome variable. Use multiple linear regression and analysis of variance for interval outcomes, multiple logistic regression and log-binomial regression with dichotomous outcomes, proportion odds regression with ordinal outcomes, multinomial logistic regression for nominal outcomes, proportional hazards analysis for time to outcome, Poisson regression and negative binomial regression for counts and for incidence rates. Each model has a different set of underlying assumptions. All of the models assume that there is only one observation of outcome for each subject.
Having determined the type of multivariable analysis to perform based on the outcome variable, one must next determine how to incorporate independent variables into the model. The important considerations are the type of independent variables you have: dichotomous, nominal, interval, or ordinal, and the relationship between the independent variable and the outcome; and the relationship between the independent variable and the outcome. Dichotomous independent variables can be used in any multivariable analysis. The other types of independent variables require special consideration. With interval variables, identifying non-linear associations is particularly important; when the association is nonlinear the variable should be transformed. The type of transformation will depend on the association with outcome. Splines are a sophisticated method of modeling complex relationships between an interval independent variable and the outcome.
Standard statistical analyses assume that each observation (subject) is independent. In other words, the outcomes of different subjects are not correlated. For example, in a longitudinal study, subjects may be assessed repeatedly. Subjects may also be enrolled in established groups or clusters such as families or physician practices. However, when this is not the case, multivariable models that incorporate correlated observations are needed. Common choices are generalized estimating equations and mixed-effects models. Generalized estimating equations are population-averaged models; they estimate the mean difference between the two groups. This is in contrast to mixed-effects models which estimate subject specific differences. Conditional logistic regression is useful with a dichotomous outcome that is measured repeatedly. The Andersen-Gill formation of the proportional hazards model is useful for censored data with outcomes that can over more than once to a subject over time.
Assessing the underlying assumptions of multiple models enables us to improve their fit. But it is a complicated process that is more art than science. The basic measure to assess the fit of models is residuals. Residuals are the difference between the observed and the estimated value. They can be thought of as the error in estimation. There are a number of possible transformations of the residuals for different multivariable procedures. For proportional hazards analysis it is important to test the proportionality assumption. This can be done using a log-minus-log survival plot, Schoenfeld’s residuals, division of time into discrete intervals, or time-dependent covariates.
An emulation (or target) trial uses observational data to simulate a trial. Because there is no actual randomization, multivariable methods need to adjust for differences between groups. However, emulation trials improve traditional observational studies by conducting all the same steps as a randomized trial with the exception of randomization. With an emulation trial, before conducting data analysis, specify research question eligibility criteria, determination of treatment groups, start of study and end of follow-up, outcome, and analysis plan. Active comparators can minimize indication bias. By setting eligibility, treatment assignment, and start of follow-up, emulation trials minimize immortal time bias.
Classification and regression trees (CART): a technique for separating subjects into distinct subgroups based on a dichotomous outcome. Its major advantage over multiple logistic regression—it more closely reflects how clinicians make decisions. Certain pieces of information take you down a particular diagnostic path for more information to prove/disprove you are on the right path. Most clinicians do not total up a weighted version of the information and make a decision.
Multivariable analysis is needed because most events, whether medical, politica, social, or personal, have multiple causes. And these causes are related to one another. Multivariable analysis enables us to determine the relative contributions of different causes to a single event or outcome.
Multivariable analysis enables us to identify and adjust for confounders. Confounders are associated with the risk factor and causally related to the outcome. Adjustment for confounders is key to distinguishing important etiologic risk factors from variables that only appear to be associated with outcomes due to their association with the true risk factor.
Stratification can also be used for identifying independent relationships between risk factors and outcomes but becomes too cumbersome when there are more than one or two possible confounders.
In setting up your model, include those variables, in addition to the risk factor or group assignment, that have been theorized or shown in prior research to be confounders or those that empirically are associated with the risk factor and the outcome in bivariate analysis.
Exclude variables that are on the intervening pathway between the risk factor and outcome, those that are extraneous because they are not on the causal pathway, redundant variables, and variables with a lot of missing data.
Sample size calculation for multivariable analysis is complicated but statistical programs exist to help you to calculate it. Missing data on independent variables can compromise your multivariable analysis. Several methods exist to compensate for missing independent data including deleting cases, using indicator variables to represent missing data and estimating the value of missing cases. Methods also exist for estimating missing outcome data using other data you have about the subject and multiple imputation.
A published report should include a sufficient explanation of the statistical methods so that someone with access to the original data could reproduce the reported results. Generally, it is best to divide the methods section of your paper into how subjects were enrolled (Subjects), what interventions were used or how data were acquired (Procedures), how the variables were coded (Measures), and how the data were analyzed (Statistical analysis). Unless there is no missing data, it is important to report the n for each analysis.
What results to report in your paper will vary based on your research question, your analysis, and the style of the journal. In general, for multiple linear regression models, report the regression coefficients, the standard errors of the coefficients, and the statistical significance levels of the coefficients. For logistic regression, report the odds ratio and the 95% confidence interval; for proportional hazards regression, report the relative hazard and the 95% confidence interval.