On the area of econometrics and data science, we sometimes use log-transformed weights for linear regression. Usually, one of the advantages of linear regression is that we can easily interpret the outcome. But by log-transformation, how should we interpret the outcome?
Overview
In many cases, we adopt linear regression to analyze data. That lets us understand how influential each feature is.
So when we use it, to make the way of interpretation easy, we want as simple features as possible. If you transform the features, you need to adjust your interpretation to that.
Simple linear regression case
At first, let’s look at the simple regression case on R. Here, I’ll use cars data set.
print(head(cars))
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
We can use the feature, speed, as explaining variable and dist as explained variable.
cars.lm <- lm(cars$dist ~ cars$speed)
plot(cars$speed, cars$dist)
abline(cars.lm, lwd=2, col="red")
The red line above means the relation between speed and dist.
summary(cars.lm)
Call:
lm(formula = cars$dist ~ cars$speed)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
That red line on the image is .
Simply, on this case, when the feature, speed, increases by 1, the dist increases by 3.9324.
Linear regression to log-transformed features
When we make linear regression model and estimate the parameters with log-transformation, how should we interpret those?
There are some patterns. And those can be summarized on the table below. The “b” means the coefficient of the variable X.
In a sense, the table above explains everything.
For example, when we make linear regression model with log-transformed explained variable Y and original explaining variable X and it estimates a and b on .
X increases by 1, Y increases by .
On the case of cars data set, it becomes like this.
cars.lm.log <- lm(log(cars$dist) ~ cars$speed)
summary(cars.lm.log)
Call:
lm(formula = log(cars$dist) ~ cars$speed)
Residuals:
Min 1Q Median 3Q Max
-1.46604 -0.20800 -0.01683 0.24080 1.01519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.67612 0.19614 8.546 3.34e-11 ***
cars$speed 0.12077 0.01206 10.015 2.41e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4463 on 48 degrees of freedom
Multiple R-squared: 0.6763, Adjusted R-squared: 0.6696
F-statistic: 100.3 on 1 and 48 DF, p-value: 2.413e-13
By cars$speed, we can understand that as the speed increases by 1, the dist increases by 12%.
Related article
On the article below, I used log-transformation.
How to deal with heteroscedasticity
On the article below, I wrote about heteroscedasticity. Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have. Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.