Wednesday, September 20, 2017

Making linear regression model by R

Overview


When we make model by data science, machine learning method, it’s not simple process such as “just throw data into SVM”. Getting, checking, processing, modeling, evaluation. There are many steps you need to care about.

On this article, by making regression model on R, I’ll show the example of part of the process.




Check the data


When we make model by data, we at first need to check the data themselves. The features can have Nan values, correlation with each features, and so on. It’s necessary to check those characteristics by statistical values and visualization.

I used airquality data set, which R has as default.
Simply, we can check basic statistical information by following command on R.

summary(airquality)

The output is as below. We can see the statistical data of each feature and the existense of Nan. The function, summary(), can literally show summary of the data. When you want to check the data, this is useful.

     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  

Although the output above is informative, it’s not so easy to understand the data at a glance. By making histogram of each feature, the distributions of each feature are shown.

attach(airquality)
par(mfrow=c(4,1)) 
hist(Ozone)
hist(Solar.R)
hist(Wind)
hist(Temp)
enter image description here

By visualization, we can know the distributions at a glance.
Next, we should check the relations between each feature. In R, cor() function returns correlation coefficients between features.

cor(airquality[,1:4], use="complete.obs")
    Ozone   Solar.R Wind    Temp
Ozone   1.0000000   0.3483417   -0.6124966  0.6985414
Solar.R 0.3483417   1.0000000   -0.1271835  0.2940876
Wind    -0.6124966  -0.1271835  1.0000000   -0.4971897
Temp    0.6985414   0.2940876   -0.4971897  1.0000000

To check the relations between features, scatter diagram is also effective.

pairs(airquality[,1:4])
enter image description here

On this case, some correlations, (Ozone, Wind) and (Ozone, Temp), can be observed.

Make model


After chicking data, we can tackle with making model. On this article, I used linear regression.
At first, let’s make the simplest model.

airquality.lm <- lm(Ozone~Temp+Solar.R+Wind, data=airquality)

The function, summary(), can be used here too.

summary(airquality.lm)
Call:
lm(formula = Ozone ~ Temp + Solar.R + Wind, data = airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.485 -14.219  -3.551  10.097  95.619 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -64.34208   23.05472  -2.791  0.00623 ** 
Temp          1.65209    0.25353   6.516 2.42e-09 ***
Solar.R       0.05982    0.02319   2.580  0.01124 *  
Wind         -3.33359    0.65441  -5.094 1.52e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 21.18 on 107 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.5948 
F-statistic: 54.83 on 3 and 107 DF,  p-value: < 2.2e-16

We can get diagnostic plots by following.

plot(airquality.lm)
enter image description here

The model above doesn’t think about mutual interaction between features. So, let’s make model with thinking about that.

airquality.interect.lm <- lm(Ozone~(Temp+Solar.R+Wind)^2, data=airquality)
summary(airquality.interect.lm)
Call:
lm(formula = Ozone ~ (Temp + Solar.R + Wind)^2, data = airquality)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.685 -11.727  -2.169   7.360  91.244 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)  -1.408e+02  6.419e+01  -2.193  0.03056 * 
Temp          2.322e+00  8.330e-01   2.788  0.00631 **
Solar.R      -2.260e-01  2.107e-01  -1.073  0.28591   
Wind          1.055e+01  4.290e+00   2.460  0.01555 * 
Temp:Solar.R  5.061e-03  2.445e-03   2.070  0.04089 * 
Temp:Wind    -1.613e-01  5.896e-02  -2.735  0.00733 **
Solar.R:Wind -7.231e-03  6.688e-03  -1.081  0.28212   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.17 on 104 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.6863,    Adjusted R-squared:  0.6682 
F-statistic: 37.93 on 6 and 104 DF,  p-value: < 2.2e-16

The coefficient of determination improved from 0.6059 to 0.6863(adjusted coefficient of determination also improved).
But some explaining variables have big Pr(>|t|). For better model, we think to choose the variables.
One of the methods to choose variables is to use AIC.

aic <- step(airquality.interect.lm)
print(aic)
Start:  AIC=662.37
Ozone ~ (Temp + Solar.R + Wind)^2

               Df Sum of Sq   RSS    AIC
- Solar.R:Wind  1    429.42 38635 661.61
<none>                      38205 662.37
- Temp:Solar.R  1   1574.75 39780 664.86
- Temp:Wind     1   2748.20 40954 668.08

Step:  AIC=661.61
Ozone ~ Temp + Solar.R + Wind + Temp:Solar.R + Temp:Wind

               Df Sum of Sq   RSS    AIC
<none>                      38635 661.61
- Temp:Solar.R  1    2141.1 40776 665.60
- Temp:Wind     1    4339.8 42975 671.43

Call:
lm(formula = Ozone ~ Temp + Solar.R + Wind + Temp:Solar.R + Temp:Wind, 
    data = airquality)

Coefficients:
 (Intercept)          Temp       Solar.R          Wind  Temp:Solar.R  
  -1.368e+02     2.451e+00    -3.531e-01     1.115e+01     5.717e-03  
   Temp:Wind  
  -1.863e-01  

From the final output, we can say Temp*Sorlar.R and Temp:Wind set should be added to the model and those coefficients are as above.
After making model, we usually check and evaluate that.