Regression Models with log transformed variables by R

tatsuro included in r

2022-05-29 2111 words 10 minutes

Contents

I read through below page, implemented R code for analysis, plotted linear or nonlinear lines for personal study.

Source

UCLA | FAQ HOW DO I INTERPRET A REGRESSION MODEL WHEN SOME VARIABLES ARE LOG TRANSFORMED?

Read data

read data from here

1

lgtrans <- read_csv("lgtrans.csv")

outcome variables log transformed

$ log(y_{i}) = \beta_{0} +\beta_{1}x_{1i} + \cdots \beta_{k}x_{ki} + e_{i}$

intercept-only model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


> lgwrite.lm <- lm(lgtrans$lgwrite ~ 1)
> summary(lgwrite.lm)

Call:
lm(formula = lgtrans$lgwrite ~ 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51436 -0.12520  0.04064  0.14600  0.25635 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.94835    0.01369   288.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1936 on 199 degrees of freedom

$ log(write) = \beta_{0} = 3.95 $

“3.95 is conditional mean of log of write”, and exponentiated value: exp(3.94835) = 51.84974.

51.84974 : This is the geometric mean of write.

Arithmetic mean: the average of a series of numbers whose sum is divided by the total count of the numbers in the series.
- $ (x+y)/2 $
Geometric mean: the compounding effect of the numbers in the series in which the numbers are multiplied by taking nth root of the multiplication.
- $ (xy)^{1/2} $
Harmonic mean: The harmonic mean is often used to calculate the average of the ratios or rates. It is the most appropriate measure for ratios and rates because it equalizes the weights of each data point. For instance, the arithmetic mean places a high weight on large data points, while the geometric mean gives a lower weight to the smaller data points.
- $ 2/\big(\frac{1}{x}+\frac{1}{y}\big) $

Ref. What Is The Difference Between Arithmetic Mean And Geometric Mean?

a model with a single binary predictor variable.

$ \begin{align} log(write) &= \beta_{0} + \beta_{1} * female \\
&= 3.89 + .10 * female \end{align} $

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


> lgtrans <- read_csv("lgtrans.csv", col_types=list(
  female = col_factor(c("male", "female")),
  read = col_double(),
  write = col_double(),
  math = col_double(),
  lgwrite = col_double(),
  lgmath = col_double()
))

> lgtrans <- lgtrans %>%
mutate(male_ind = ifelse(female == "male",1,0)) %>%
mutate(female_ind = ifelse(female == "female",1,0))

> show(lgtrans)
# A tibble: 200 × 8
   female  read write  math lgwrite lgmath male_ind female_ind
   <fct>  <dbl> <dbl> <dbl>   <dbl>  <dbl>    <dbl>      <dbl>
 1 male      57    52    41    3.95   3.71        1          0
 2 female    68    59    53    4.08   3.97        0          1
 3 male      44    33    54    3.50   3.99        1          0
 4 male      63    44    47    3.78   3.85        1          0

a single binary predictor (write by female)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


> write_by_female.lm <- lm(lgtrans$write ~ lgtrans$female_ind, data=lgtrans)

> summary(write_by_female.lm)

Call:
lm(formula = lgtrans$write ~ lgtrans$female_ind, data = lgtrans)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.991  -6.371   1.879   7.009  16.879 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         50.1209     0.9628  52.057  < 2e-16 ***
lgtrans$female_ind   4.8699     1.3042   3.734 0.000246 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.185 on 198 degrees of freedom
Multiple R-squared:  0.06579,	Adjusted R-squared:  0.06107 
F-statistic: 13.94 on 1 and 198 DF,  p-value: 0.0002463

a single binary predictor (log write by female)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


> lgwrite_by_female.lm <- lm(lgtrans$lgwrite ~ lgtrans$female_ind, data=lgtrans)

> summary(lgwrite_by_female.lm)

Call:
lm(formula = lgtrans$lgwrite ~ lgtrans$female_ind, data = lgtrans)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45808 -0.11363  0.04772  0.14780  0.31262 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         3.89207    0.01961 198.446  < 2e-16 ***
lgtrans$female_ind  0.10326    0.02657   3.887 0.000139 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1871 on 198 degrees of freedom
Multiple R-squared:  0.07089,	Adjusted R-squared:  0.0662 
F-statistic: 15.11 on 1 and 198 DF,  p-value: 0.0001385

The intercept of 3.89 is the log of geometric mean write of when female=0, i.e., for males. Therefore, the exponentiated value of it is the geometric mean for the male group: exp(3.892) = 49.01.

Calculate geometric mean manually.

1
2
3
4
5
6


> male_df <- lgtrans[lgtrans$male_ind==1,]
> exp(mean(log(c(male_df$write))))
[1] 49.01222
> female_df <- lgtrans[lgtrans$female_ind==1,]
> exp(mean(log(c(female_df$write))))
[1] 54.34383

In the log scale, it is the difference in the expected geometric means of the log of write between the female students and male students.

In the original scale of the variable write, it is the ratio of “the geometric mean of write for female students” over “the geometric mean of write for male students”.

$ exp(.10326) = 54.3483 / 49.01222 = 1.108781　$

1
2
3
4


> exp(.10326)
[1] 1.10878
> 54.34383/49.01222
[1] 1.108781

Switching from male students to female students, we expect to see about 11% increase in the geometric mean of writing scores.

Ref. Statistic Globe | geometric mean in R

a model with multiplel predictor variable

$ \begin{align} log(write) &= \beta_{0} + \beta_{1} \times female + \beta_{2} \times read + \beta_{3} \times math \\
&= 3.135 + .115 \times female + .0066 \times read + .0077 \times math \end{align} $

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


> lgwrite_by_predictors.lm <- lm(lgtrans$lgwrite ~ lgtrans$female_ind+lgtrans$read+lgtrans$math, data=lgtrans)
> summary(lgwrite_by_predictors.lm)

Call:
lm(formula = lgtrans$lgwrite ~ lgtrans$female_ind + lgtrans$read + 
    lgtrans$math, data = lgtrans)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.41745 -0.08146  0.01182  0.09408  0.28824 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        3.135243   0.059811  52.419  < 2e-16 ***
lgtrans$female_ind 0.114718   0.019534   5.873 1.81e-08 ***
lgtrans$read       0.006631   0.001269   5.225 4.43e-07 ***
lgtrans$math       0.007679   0.001387   5.535 9.88e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1374 on 196 degrees of freedom
Multiple R-squared:  0.5042,	Adjusted R-squared:  0.4966 
F-statistic: 66.44 on 3 and 196 DF,  p-value: < 2.2e-16

For the variable female_ind, $ exp(\beta_{1}) = exp(.114718) = 1.121557 $. Writing scores will be 12% higher for the female students than for the male students.

For the variable read, $ exp(\beta_{2}) = exp(.006631) = 1.006653 $. Writing scores will be 0.7% higher by reading score. For a ten-unit increase in read, we expect to see about $ exp(.006631 \times 10) = 1.0685526 \approx 6.9% $ increase in writing score. .

when the outcome variable is log transformed, it is natural to interpret the exponentiated regression coefficients. These values correspond to changes in the ratio of the expected geometric means of the original outcome variable.

some (not all) predictor variables are log transformed

$ \begin{align} write &= \beta_{0} + \beta_{1} \times female + \beta_{2} \times log(math) + \beta_{3} \times log(read) \\
&= -99.164 + 5.389 \times female + 20.941 \times log(math) + 16.852 \times log(read) \end{align} $

Mutate a column with log(read)

1

lgtrans <- mutate(lgtrans, lgread = log(read))

Create regression model

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


write_by_some_logged_predictors.lm <- lm(lgtrans$write ~ lgtrans$female_ind+lgtrans$lgmath+lgtrans$lgread, data=lgtrans)

Call:
lm(formula = lgtrans$write ~ lgtrans$female_ind + lgtrans$lgmath + 
    lgtrans$lgread, data = lgtrans)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.0244  -3.4964   0.4328   3.9987  13.9675 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -99.1640    10.8041  -9.178  < 2e-16 ***
lgtrans$female_ind   5.3888     0.9308   5.789 2.77e-08 ***
lgtrans$lgmath      20.9410     3.4309   6.104 5.45e-09 ***
lgtrans$lgread      16.8522     3.0634   5.501 1.17e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.547 on 196 degrees of freedom
Multiple R-squared:  0.5301,	Adjusted R-squared:  0.5229 
F-statistic:  73.7 on 3 and 196 DF,  p-value: < 2.2e-16

SInce this is an OLS regression (Ordinary Least Square regression), the interpretation of the regression coefficients for the non-transformed variables (such as female) are unchanged from an OLS regression without any transformed variables. On the other hand, due to the log transformation, the estimated effects of math and read are no longer linear, even though the effect of log(math) and log(read) are linear.

1
2
3
4


lgtrans_female <- lgtrans[lgtrans$female_ind==1,]
log_reg <- lm(lgtrans_female$write~lgtrans_female$read)
> plot(lgtrans_female$read, lgtrans_female$write, xlab="reading score", ylab="writing score")
> abline(log_reg)

drawing smooth line like this in the text

1
2
3
4
5


fm <- nls(lgtrans_female$write~a*log(lgtrans_female$read)+b, start=c(a=1, b=35), trace=TRUE, data=lgtrans_female)
plot(lgtrans_female$read, lgtrans_female$write)
a1 <- coef(fm)[1]
b1 <- coef(fm)[2]
lines(x<-c(25:80), a1*log(x)+b1)

using calculated intercepts and fixed value of math

1
2
3
4


fm2 <- nls(lgtrans_female$write~a*log(lgtrans_female$read)+(-99.164+5.399*1+20.941*log(mean(lgtrans_female$math))), start=c(a=1), trace=TRUE, data=lgtrans_female)
plot(lgtrans_female$read, lgtrans_female$write)
a2 <- coef(fm2)[1]
lines(x<-c(25:80), a2*log(x)+(-99.164+5.399*1+20.941*log(mean(lgtrans_female$math))))

Some note in text

Taylor expansion of the funnction $ f_{x} = log(1+x) around x_{0} = 0, log(1+x) = x + \mathcal{O}(x^2) $
as long as the percent increase in (the predictor variable) is fixed, we will see the same difference in writing score, regardless where the baseline reading score is. For example, we can say that for a $10%$ increase in reading score, the difference in the expected mean writing scores will be always $ \beta_{3} \times 0.01 = 16.85218 \times 0.01 = .1685218 $. If we use the log, the exact value will be $ \beta_{3} \times log(1.01) = 16.85218 \times log(0.01) = .1676848 $

Both the outcome variable and some predictor variables are log transformed

$ \begin{align} log(write) &= \beta_{0} + \beta_{1} \times female + \beta_{2} \times log(math) + \beta_{3} \times read \\
&= 1.928100 + .114240 \times female + .408537 \times log(math) + .006609 \times read \end{align} $

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


both_outcome_and_predictors_logged.lm <- lm(lgtrans$lgwrite ~ lgtrans$female_ind+lgtrans$lgmath+lgtrans$read, data=lgtrans)
summary(both_outcome_and_predictors_logged.lm)

Call:
lm(formula = lgtrans$lgwrite ~ lgtrans$female_ind + lgtrans$lgmath + 
    lgtrans$read, data = lgtrans)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.42341 -0.08417  0.01464  0.09446  0.29203 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.928100   0.246939   7.808 3.42e-13 ***
lgtrans$female_ind 0.114240   0.019471   5.867 1.86e-08 ***
lgtrans$lgmath     0.408537   0.072079   5.668 5.11e-08 ***
lgtrans$read       0.006609   0.001256   5.261 3.74e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1369 on 196 degrees of freedom
Multiple R-squared:  0.5074,	Adjusted R-squared:  0.4999 
F-statistic:  67.3 on 3 and 196 DF,  p-value: < 2.2e-16

Some notes from the source text.

For variables that are not transformed, such as female, its exponentiated coefficient is the ratio of the geometric mean for the female to the geometric mean for the male students group. For example, in our example, we can say that the expected percent increase in geometric mean from male student group to female student group is about $12%$ holding other variables constant, since $exp(.114240) \approx 1.12 $.
For reading score, we can say that for a one-unit increase in reading score, we expected to see about $0.7 %$ of increase in the geometric mean of writing score, since $exp(.006609) = 1.007$.
For math, $ log(write(m_{2}))-log(write(m_{1})) = \beta_{2} \times [log(m_{2})-log(m_{1})] $. This can be simplified to $log[write(m_{2})/write(m_{1})] = \beta_{2} \times [log(m_{2}/m_{1})]$, leading to $ \frac{write(m_{2})}{write(m_{1})} = (\frac{m2}{m1})^\beta_2$. In any score of $\frac{m2}{m1}$, expected ratio of the outcome variable, write, stays the same. (e.g. 10% increase of math score, the expected ratio of the writing score will be $ (1.10)^\beta_2 = (1.10)^{.408537} = 1.039706 $. In other words, we expect about $4%$ increase in writing score when math score increases by 10 %)

final_portofolio_tech$Date <- as.Date(as.character(final_portofolio_tech$Date)) final_portofolio_tech

A tibble: 12 × 5

Date Ra ContraRet Market.Return Risk.Free 1 2017-01-01 0.0624 0.0438 0.0198 0.0004 2 2017-02-01 0.0291 0.0385 0.0361 0.0004 3 2017-03-01 0.0294 0.0156 0.002 0.0003 4 2017-04-01 0.0652 0.0282 0.0114 0.0005 5 2017-05-01 0.0462 0.0359 0.0112 0.0006 6 2017-06-01 -0.0304 -0.00401 0.0084 0.0006 7 2017-07-01 0.0543 0.0354 0.0194 0.0007 8 2017-08-01 0.0108 0.0161 0.0025 0.0009 9 2017-09-01 0.000587 0.00849 0.026 0.0009 10 2017-10-01 0.0812 0.0473 0.0234 0.0009 11 2017-11-01 0.0110 0.0159 0.032 0.0008 12 2017-12-01 0.00493 0.00290 0.0115 0.0009

final_data_tech <- xts(final_portofolio_tech[,-1], final_portofolio_tech$Date) final_data_tech Ra ContraRet Market.Return Risk.Free 2017-01-01 0.0624169470 0.043779 0.0198 4e-04 2017-02-01 0.0291020353 0.038530 0.0361 4e-04 2017-03-01 0.0294221322 0.015562 0.0020 3e-04 2017-04-01 0.0652275032 0.028232 0.0114 5e-04 2017-05-01 0.0462257562 0.035947 0.0112 6e-04 2017-06-01 -0.0304448493 -0.004010 0.0084 6e-04 2017-07-01 0.0543341027 0.035364 0.0194 7e-04 2017-08-01 0.0108327543 0.016064 0.0025 9e-04 2017-09-01 0.0005871627 0.008487 0.0260 9e-04 2017-10-01 0.0811780214 0.047277 0.0234 9e-04 2017-11-01 0.0109960534 0.015914 0.0320 8e-04 2017-12-01 0.0049291384 0.002905 0.0115 9e-04

sharperatio_tech_contra <- SharpeRatio(final_data_tech$ContraRet, final_data_tech$Risk.Free) sharperatio_tech <- SharpeRatio(final_data_tech$Ra, final_data_tech$Risk.Free) sharperatio_tech_contra ContraRet StdDev Sharpe (Rf=0.1%, p=95%): 1.370290 VaR Sharpe (Rf=0.1%, p=95%): 6.027429 ES Sharpe (Rf=0.1%, p=95%): 3.414770 sharperatio_tech Ra StdDev Sharpe (Rf=0.1%, p=95%): 0.9142533 VaR Sharpe (Rf=0.1%, p=95%): 1.3059704 ES Sharpe (Rf=0.1%, p=95%): 0.9409596 Return.cumulative(final_data_tech, geometric = TRUE) Ra ContraRet Market.Return Risk.Free Cumulative Return 0.4245635 0.322149 0.2230511 0.007928391