Regression Models with log transformed variables by R
I read through below page, implemented R code for analysis, plotted linear or nonlinear lines for personal study.
Source
UCLA | FAQ HOW DO I INTERPRET A REGRESSION MODEL WHEN SOME VARIABLES ARE LOG TRANSFORMED?
Read data
read data from here
|
|
outcome variables log transformed
$ log(y_{i}) = \beta_{0} +\beta_{1}x_{1i} + \cdots \beta_{k}x_{ki} + e_{i}$
intercept-only model
|
|
$ log(write) = \beta_{0} = 3.95 $
“3.95 is conditional mean of log of write”, and exponentiated value: exp(3.94835)
= 51.84974
.
51.84974
: This is the geometric mean of write.
- Arithmetic mean: the average of a series of numbers whose sum is divided by the total count of the numbers in the series.
- $ (x+y)/2 $
- Geometric mean: the compounding effect of the numbers in the series in which the numbers are multiplied by taking nth root of the multiplication.
- $ (xy)^{1/2} $
- Harmonic mean: The harmonic mean is often used to calculate the average of the ratios or rates. It is the most appropriate measure for ratios and rates because it equalizes the weights of each data point. For instance, the arithmetic mean places a high weight on large data points, while the geometric mean gives a lower weight to the smaller data points.
- $ 2/\big(\frac{1}{x}+\frac{1}{y}\big) $
Ref. What Is The Difference Between Arithmetic Mean And Geometric Mean?
a model with a single binary predictor variable.
$ \begin{align}
log(write)
&= \beta_{0} + \beta_{1} * female \\
&= 3.89 + .10 * female
\end{align} $
|
|
a single binary predictor (write by female)
|
|
a single binary predictor (log write by female)
|
|
The intercept of 3.89 is the log of geometric mean write of when female=0, i.e., for males. Therefore, the exponentiated value of it is the geometric mean for the male group: exp(3.892) = 49.01.
Calculate geometric mean manually.
|
|
In the log scale, it is the difference in the expected geometric means of the log of write between the female students and male students.
In the original scale of the variable write, it is the ratio of “the geometric mean of write for female students” over “the geometric mean of write for male students”.
$ exp(.10326) = 54.3483 / 49.01222 = 1.108781 $
|
|
Switching from male students to female students, we expect to see about 11% increase in the geometric mean of writing scores.
Ref. Statistic Globe | geometric mean in R
a model with multiplel predictor variable
$ \begin{align}
log(write)
&= \beta_{0} + \beta_{1} \times female + \beta_{2} \times read + \beta_{3} \times math \\
&= 3.135 + .115 \times female + .0066 \times read + .0077 \times math \end{align} $
|
|
For the variable female_ind
, $ exp(\beta_{1}) = exp(.114718) = 1.121557 $. Writing scores will be 12% higher for the female students than for the male students.
For the variable read
, $ exp(\beta_{2}) = exp(.006631) = 1.006653 $. Writing scores will be 0.7% higher by reading score. For a ten-unit increase in read, we expect to see about $ exp(.006631 \times 10) = 1.0685526 \approx 6.9% $ increase in writing score. .
when the outcome variable is log transformed, it is natural to interpret the exponentiated regression coefficients. These values correspond to changes in the ratio of the expected geometric means of the original outcome variable.
some (not all) predictor variables are log transformed
$ \begin{align}
write
&= \beta_{0} + \beta_{1} \times female + \beta_{2} \times log(math) + \beta_{3} \times log(read) \\
&= -99.164 + 5.389 \times female + 20.941 \times log(math) + 16.852 \times log(read) \end{align} $
Mutate a column with log(read)
|
|
Create regression model
|
|
SInce this is an OLS regression (Ordinary Least Square regression), the interpretation of the regression coefficients for the non-transformed variables (such as female) are unchanged from an OLS regression without any transformed variables. On the other hand, due to the log transformation, the estimated effects of math and read are no longer linear, even though the effect of log(math) and log(read) are linear.
|
|
drawing smooth line like this in the text
|
|
using calculated intercepts and fixed value of math
|
|
Some note in text
- Taylor expansion of the funnction $ f_{x} = log(1+x) around x_{0} = 0, log(1+x) = x + \mathcal{O}(x^2) $
- as long as the percent increase in (the predictor variable) is fixed, we will see the same difference in writing score, regardless where the baseline reading score is. For example, we can say that for a $10%$ increase in reading score, the difference in the expected mean writing scores will be always $ \beta_{3} \times 0.01 = 16.85218 \times 0.01 = .1685218 $. If we use the log, the exact value will be $ \beta_{3} \times log(1.01) = 16.85218 \times log(0.01) = .1676848 $
Both the outcome variable and some predictor variables are log transformed
$ \begin{align} log(write)
&= \beta_{0} + \beta_{1} \times female + \beta_{2} \times log(math) + \beta_{3} \times read \\
&= 1.928100 + .114240 \times female + .408537 \times log(math) + .006609 \times read
\end{align} $
|
|
Some notes from the source text.
- For variables that are not transformed, such as female, its exponentiated coefficient is the ratio of the geometric mean for the female to the geometric mean for the male students group. For example, in our example, we can say that the expected percent increase in geometric mean from male student group to female student group is about $12%$ holding other variables constant, since $exp(.114240) \approx 1.12 $.
- For reading score, we can say that for a one-unit increase in reading score, we expected to see about $0.7 %$ of increase in the geometric mean of writing score, since $exp(.006609) = 1.007$.
- For math, $ log(write(m_{2}))-log(write(m_{1})) = \beta_{2} \times [log(m_{2})-log(m_{1})] $. This can be simplified to $log[write(m_{2})/write(m_{1})] = \beta_{2} \times [log(m_{2}/m_{1})]$, leading to $ \frac{write(m_{2})}{write(m_{1})} = (\frac{m2}{m1})^\beta_2$. In any score of $\frac{m2}{m1}$, expected ratio of the outcome variable, write, stays the same. (e.g. 10% increase of math score, the expected ratio of the writing score will be $ (1.10)^\beta_2 = (1.10)^{.408537} = 1.039706 $. In other words, we expect about $4%$ increase in writing score when math score increases by 10 %)
final_portofolio_tech$Date <- as.Date(as.character(final_portofolio_tech$Date)) final_portofolio_tech
A tibble: 12 × 5
Date Ra ContraRet Market.Return Risk.Free
final_data_tech <- xts(final_portofolio_tech[,-1], final_portofolio_tech$Date) final_data_tech Ra ContraRet Market.Return Risk.Free 2017-01-01 0.0624169470 0.043779 0.0198 4e-04 2017-02-01 0.0291020353 0.038530 0.0361 4e-04 2017-03-01 0.0294221322 0.015562 0.0020 3e-04 2017-04-01 0.0652275032 0.028232 0.0114 5e-04 2017-05-01 0.0462257562 0.035947 0.0112 6e-04 2017-06-01 -0.0304448493 -0.004010 0.0084 6e-04 2017-07-01 0.0543341027 0.035364 0.0194 7e-04 2017-08-01 0.0108327543 0.016064 0.0025 9e-04 2017-09-01 0.0005871627 0.008487 0.0260 9e-04 2017-10-01 0.0811780214 0.047277 0.0234 9e-04 2017-11-01 0.0109960534 0.015914 0.0320 8e-04 2017-12-01 0.0049291384 0.002905 0.0115 9e-04
sharperatio_tech_contra <- SharpeRatio(final_data_tech$ContraRet, final_data_tech$Risk.Free) sharperatio_tech <- SharpeRatio(final_data_tech$Ra, final_data_tech$Risk.Free) sharperatio_tech_contra ContraRet StdDev Sharpe (Rf=0.1%, p=95%): 1.370290 VaR Sharpe (Rf=0.1%, p=95%): 6.027429 ES Sharpe (Rf=0.1%, p=95%): 3.414770 sharperatio_tech Ra StdDev Sharpe (Rf=0.1%, p=95%): 0.9142533 VaR Sharpe (Rf=0.1%, p=95%): 1.3059704 ES Sharpe (Rf=0.1%, p=95%): 0.9409596 Return.cumulative(final_data_tech, geometric = TRUE) Ra ContraRet Market.Return Risk.Free Cumulative Return 0.4245635 0.322149 0.2230511 0.007928391
|
|