How to improve the substantive interpretation of regression results when the dependent variable is logged

Regression models with log-transformed dependent variables are widely used by social scientists to investigate nonlinear relationships between variables. Unfortunately, this transformation complicates the sub-stantive interpretation of estimation results and often leads to incomplete and sometimes even misleading interpretations. We focus on one valuable but underused method, the presentation of quantities of interest such as expected values or first differences on the original scale of the dependent variable. The procedure to derive these quantities differs in seemingly minor but critical aspects from the well-known procedure based on standard linear models. To improve empirical practice, we explain the underlying problem and develop guidelines that help researchers to derive meaningful interpretations from regression results of models with log-transformed dependent variables

Regression models with log-transformed dependent variables are widely used by social scientists to investigate nonlinear relationships between variables.Unfortunately, this transformation complicates the substantive interpretation of respective estimation results.In an effort to improve empirical practice, we clarify one popular strategy for the substantive interpretation of such regression results-the presentation of quantities of interest such as predicted values, expected values, or first differences on the original scale of the dependent variable (King et al., 2000).We show that calculating such quantities together with their associated uncertainty is different from well-known procedures that work in the case of linear regression models without logtransformed dependent variables.Ignoring this difference can lead to erroneous communication of regression results when the dependent variable is log-transformed.
The key point of confusion is this: a regression with a logged dependent variable estimates E[ln( y|X )].For a substantive interpretation we want to calculate quantities of interest and their associated uncertainty on the original scale of the dependent variable rather than the logged scale.However, scholars cannot simply exponentiate expected values, standard errors, or upper and lower bounds of the estimated confidence intervals on the logged scale in order to transform them to the original scale.While ln( y|X ) is normally distributed, its transformation y|X back to the original scale is skewed.The consequence is that e E[ln(y|X)] =; e ln (E[y|X]) = E[y|X].This is well known among methodologists (e.g., Manning, 1998), but often neglected by substantive scholars.To derive the desired quantities on the original scale together with the associated uncertainty, scholars need to carefully apply appropriate transformation formulas and simulate their respective confidence intervals correctly.
Popular methods textbooks also acknowledge this "retransformation problem" (e.g., Cameron and Trivedi, 2022: Section 4.2.3;see also Cameron and Trivedi, 2005: Section 20.5.2).In this research note, we add two important aspects.Cameron and Trivedi (2022) declare that the prediction e E[ln( y|X )] is a "very poor" and incorrect prediction for E [ y|X ].We show that e E[ln( y|X )] is indeed not a valid prediction for the mean on the original scale.However, it is still an interesting quantity as it represents the conditional median of the log-normal distribution of y|X.Furthermore, the solutions described in Cameron and Trivedi (2022) focus on the point estimate of the conditional mean, but offer little guidance on the uncertainty that comes with those estimates.Here, we integrate solutions to the retransformation problem of point estimates with the simulation algorithm by King et al. (2000).In doing so, our solution also provides accurate confidence intervals for E[ y|X ] and other quantities of interest on the original scale of the dependent variable.
Even though the presentation of meaningful quantities of interest became best practice for the interpretation of a wide range of statistical models, our review of current practice shows that substantive scholars make little use of this approach when interpreting results from regression models with logged dependent variables.We base this conclusion on a content analysis of all research articles published in the American Political Science Review and American Journal of Political Science between 2015 and 2020.In total, we identify 39 articles in which scholars report at least one statistical model with a log-transformed dependent variable.
We identify three main styles of interpretation. 1 First, an "old school" strategy that uses the mere direction (positive/negative) and statistical significance of regression coefficients for interpretation (in 6 out of 39 articles).Those articles provide no substantive interpretation of the respective results and their uncertainty.Second, the most popular practice is the interpretation of regression coefficients as "percent increase" of the dependent variable (used in 31 out of 39 articles).While this practice is not incorrect, we see an important shortcoming.Any concrete interpretation of a "percent increase" does not provide us with an adequate sense of the effect's absolute magnitude.Furthermore, scholars rarely present the uncertainty associated with a "percent increase."Finally, there is an interpretation of regression results through the presentation of quantities of interest such as predicted values, expected values, or first differences (in 13 out of 39 articles).This is the one we recommend if done correctly and effectively.Surprisingly, few authors provide uncertainty assessments for their quantities of interest (only 3 of 13 quantities are presented with uncertainty estimates).Because there is no reason why scholars should not be interested in communicating uncertainty on the original scale of the dependent variable, we interpret this result as stemming from a lack of guidance in how to correctly derive quantities of interest together with appropriate confidence intervals on the original scale when the dependent variable is log-transformed.
In this research note, we provide guidance on how substantive scholars can improve their interpretation of regression results when the dependent variable is logged.We show how to calculate quantities of interest on the original scale even when the dependent variable is logtransformed, and how to derive respective confidence intervals using simulations.Furthermore, we highlight how the nonlinear nature of the log-transformation has important consequences for the calculation of first differences and how the presentation of first differences is especially useful for the interpretation of estimation results from models that include interaction terms.We illustrate the utility of our approach with a reanalysis of a recent study on executive appointment processes in the USA.

Calculating quantities of interest when the dependent variable is logged
To calculate quantities of interest on the original scale for models with logged dependent variables, the workflow is to log-transform the dependent variable y, estimate the regression model 1 The categories are not mutually exclusive.A list of all articles can be found in the Supporting information (SI.1).
with ln( y) as dependent variable, and then use estimation results to calculate quantities of interest, such as E[ln( y)], although substantively rarely meaningful.To get substantively meaningful quantities, one needs to transform those quantities back to the variable's original scale to get E[ y].While the transformation in the first step is fairly simple-we take the natural log of each value of y-the back transformation requires careful thinking.
The back transformation is not straightforward because it maps ln( y) back to y, which is skewed log-normally distributed conditional on the model.For y = e ln (y) = e Xb+e one can show that ln (y)] (e.g., Manning, 1998).Therefore, we cannot simply exponentiate E[ln( y)] to obtain E[ y].Table 1 provides an overview of the correct transformation formulas: if we are interested in E[ y], we need to transform the estimates on the log-scale with E[y] = e E[ ln (y)]+(1/2) ŝ2 .But e E[ln( y)] is an interesting quantity as well, as this represents the median of the resulting log-normal distribution.Both the mean and the median of y can be interesting and reasonable quantities to present, but researchers must be aware of the difference and should not confuse one with the other when interpreting results.For unimodal, continuous skewed distributions, such as the log-normal distributed y, the median is often considered to be a more typical value than the mean (von Hippel, 2005).
To develop a deeper intuition and to illustrate the consequences of choosing the mean or the median as a typical value in the context of linear models, we walk through a motivating example introduced by Rainey (2017).Consider the following data generating process (DGP): ln (Income) = b cons + b edu Education + e, and e N(0, s 2 ) (1) The true values of the coefficients are given by β cons = 2.5, β edu = 0.1, and σ 2 = 1.The challenge with this DGP is that the dependent variable is log-transformed.Scholars, however, are usually interested in interpreting the results on the original scale of the dependent variable because we are interested in the results in dollars rather than ln(dollar).Suppose that we are interested in a typical income for a person with 20 years of education given our model.How can we calculate such a typical value of income in dollars, even if the dependent variable is income in ln(dollar)?First, we need to choose whether we want to present the median or mean as our typical value.Both can be interesting.To calculate a point estimate of the median income conditional on our scenario of 20 years of education, we get Med(Y c ) = e 2.5+0.1×20= e 4.5 ≈ 90.06.If we are interested in a point estimate of the mean income conditional on 20 years of education, we get E(Y c ) = e 2.5+0.1×20+1/2×1= e 5 ≈ 148.41.This shows that the mean and median are two very distinct quantities. 2   Table 1.Transformation formulas for point estimates of common quantities of interest, and their approximated distributions to construct correct confidence intervals by using α/2 and 1 − α/2 percentiles

Quantity of interest
Point estimate Simulated confidence interval: use Qa Expected value The sampling distributions of these estimators for Med(Y c ) = e Xc b and E(Y c ) = e Xc b +(1/2) ŝ2 are right skewed as well and thus the estimators are biased.Rainey (2017) describes this bias as transformation induced bias.

Simulating confidence intervals when the dependent variable is logged
Every estimation entails uncertainty.Transparent communication of this uncertainty is fundamental to scientific practice.In this section, we show how the simulation approach by King et al. (2000) can be used to get confidence intervals for both, the median and mean of y on the original scale.Figure 1 provides the algorithm to simulate confidence intervals for the mean when the dependent variable is not transformed (left column), the median when the dependent variable is logtransformed (center column), and the mean when the dependent variable is log-transformed (right column).Consider the workflow for models with untransformed dependent variables first.
The procedure consists of four steps.In step 1 we approximate the distributions of the estimated coefficients to account for estimation uncertainty.(King et al., 2000) when the dependent variable is untransformed (left column), for the median when the dependent variable is logged (center column), and for the mean when the dependent variable is logged (right column).
MVN( b , s2 (X ′ X) −1 ).In step 2 we choose our scenario of interest, i.e., we specify covariate values X c that are held constant during simulation.In step 3 we calculate X c b, the linear combination of the simulations b and the chosen values of the covariates (X c ) that define the scenario of interest.This results in a simulated distribution Ẽ(Y c ) of expected values of Y conditional on the specified scenario X c .In step 4 we get a (1 − a) × 100%-confidence interval by summarizing the distribution with the α/2 and 1 − α/2 percentiles.Now consider the center column in Figure 1 where we outline the simulation procedure for a confidence interval for the conditional median of a model with a logged dependent variable.Using the example in Equation 1, suppose that we are interested in the median income in dollars of a person with 20 years of education.Steps 1-3 do not differ from the standard procedure: we draw simulations of the model coefficients ( b and s2 ), we set Education = 20, and compute the linear combination of the simulated coefficients and the scenario of interest.This yields Ẽ( ln (Y c )), a simulated distribution of ln (Income) conditional on 20 years of education.To get confidence intervals for the median in dollars, we exponentiate the distribution of Ẽ( ln (Y c )) as shown in the transformation step in Figure 1.Step 4 then is the same as before.Note that the resulting distribution is skewed and the confidence intervals will not be symmetric around the point estimate from the previous section. 4 In the right column of Figure 1, we show how to get confidence intervals for the mean, e.g., the mean income in dollars of a person with 20 years of education.The procedure mostly remains the same with the exception of step 3 where we need to add (1/2) s2 to X c b.The derivation of confidence intervals for first differences follows equivalently (see Table 1 for an overview).5

First-differences and interaction terms with log-transformed dependent variables
We point out two additional issues that arise when presenting quantities of interest on the original scale from regression models with log-transformed dependent variables.First, the magnitude of a first difference based on regression models with log-transformed dependent variables depends on all covariates in the scenario, even those that are held constant.This is different for regular linear regression models where the point estimate of the first difference does not depend on the values of covariates that are held constant across scenarios.
Second, if an interaction term is present in the regression model, e.g., ln (y then the first difference E( y|D = 1, X ) − E( y|D = 0, X ) can increase at higher levels of X, even if β 3 is negative (and vice versa).This is different for regular linear models where the first difference is given by the marginal effect of D, i.e., β 1 + β 3 X, and changes as a linear function of X at a rate of β 3 .Figure 2 demonstrates this notion.
For applied work, these peculiarities show the benefit of presenting quantities of interest of regression models with log-transformed dependent variables on the original scale.At the same time they alert researchers to choose and justify all values in their scenarios carefully.An insensitive selection of any of a model's covariate values may artificially inflate or deflate the size of a first difference.

Application-a reanalysis of Hollibaugh and Rothenberg (2018)
To demonstrate the utility of our approach, we reanalyze the results of a recently published study by Hollibaugh and Rothenberg (2018).The study investigates factors that influence executive appointment processes in the US context.The authors, among other things, study the relation between agency dependence and appointee ideology.We are specifically interested in one of 4 This is also the reason why taking the average of the simulations is not a good strategy to get a point estimate.As the distribution is right skewed, continuous, and unimodal, the mean of this distribution will be biased upward.their hypotheses: the higher the independence of the decision-maker in the targeted agency, the higher the ideological divergence between the president and a nominee. 6 To test this hypothesis, Hollibaugh and Rothenberg (2018) estimate linear models.Their dependent variable is the natural log of the ideological divergence between a nominee and the president (Nominee-President Divergence).The key independent variable is Agency Decision Maker Independence.In support of the hypothesis, Hollibaugh and Rothenberg (2018) find that Agency Decision Maker Independence is positively associated with the divergence between the president and the nominee.
To facilitate the interpretation of this effect, Hollibaugh and Rothenberg (2018) report "expected values" of ideological divergence between the president and the nominee from low to high agency decision-maker independence.All binary variables are set to zero, and following an average case approach, all covariate values of continuous variables to their means.We replicate the analysis following our guidelines and present both conditional mean and median values of the dependent variable in Figure 3. 7 Our reanalysis reveals two things: first, researchers are not always aware of the difference between conditional mean and conditional median values of y.What Hollibaugh and Rothenberg (2018) present as expected values are actually conditional median values.Second, this difference is not trivial.Figure 3 demonstrates that the conditional mean

6
This relates to hypotheses 6 in the original article.We refer interested readers to the original article for more details on the theoretical arguments behind this hypothesis.Next, we illustrate how the selection of covariate values matters for first differences.The right panel of Figure 3 presents two first differences.Both estimates show a first difference of the median between the minimum and maximum values of agency decision-maker independence, but we alternated the values of the covariates that are held constant.One first difference is based on an average case scenario (as in the left panel of Figure 3), for the other scenario we set all these covariates to either their minimum or maximum.This would not affect the magnitude of the first difference in regular linear models, but it clearly affects the magnitude in this case where the dependent variable is log-transformed.The first difference amounts to 0.075 for the average case setting, but it roughly doubles to 0.156 if we fix the control variables at more extreme values.

Conclusion
We have shown how to apply appropriate transformation formulas to estimated coefficients of linear regression models with logged dependent variables in order derive various quantities of interest on the original scale, and how to derive respective confidence intervals using simulations.We conclude with a set of four recommendations that researchers should keep in mind when improving the interpretation of such models.
First, it makes a difference whether conditional mean or median values are presented.Unless there is a special theoretical interest in only one of both quantities, our advice is to present both the conditional mean and the conditional median.Second, point estimates of conditional mean and median values should be calculated directly based on the point estimates of the regression model using appropriate transformation formulas (see Table 1).The simulation method in combination with the same formulas allows to derive respective confidence intervals.Third, even values that are held constant across simulations, typically values of control variables, are influential for quantities of interest on the original scale.These values have to be chosen and communicated transparently.Typical strategies are to set those variables to their means, their medians, or to observed values (Hanmer and Kalkan, 2013).Fourth, if the model includes one or more interaction terms, researchers should refrain from interpreting marginal effects on the logged scale.Larger marginal effects on the logged scale do not necessarily reflect larger marginal effects on the original scale.In fact, the opposite may be true as we have shown.To interpret respective results, our advice is to always calculate first differences on the original scale of the dependent variable.

Software
A software implementation of the proposed method is available as an open-source R package, simloglm, at https://github.com/mneunhoe/simloglm.We provide a code example for using the package, as well as calculating all mentioned quantities of interest and confidence intervals by hand in R in the Supporting information (SI.3).
Supplementary material.The supplementary material for this article can be found at https://doi.org/10.1017/psrm.2023.29 .To obtain replication material for this article, https://doi.org/10.7910/DVN/KZWKT6 3 We start by drawing s2 from an inverse-gamma distribution, Inv−G((N − k)/2, ( ŝ2 (N − k))/2), where ŝ2 is the estimated variance, N is the number of observations, and k is the number of coefficients ( b ).Next, we approximate the distribution of b by drawing simulations b from the multivariate normal distribution

Figure 1 .
Figure 1.Algorithm to simulate confidence intervals(King et al., 2000) when the dependent variable is untransformed (left column), for the median when the dependent variable is logged (center column), and for the mean when the dependent variable is logged (right column).

Figure 3 .
Figure 3. Conditional mean and median values of nominee-president ideological divergence and first difference between minimum and maximum values of the independent variable, conditional on two different sets of covariate values.

7
Hollibaugh and Rothenberg (2018) z-transform the independent variables and transform simulated values to the empirical percentile scale of the dependent variable.https://doi.org/10.1017/psrm.2023.median of y are two very distinct quantities.The conditional mean values are considerably larger than conditional median values.