Overview
When we make model by data science, machine learning method, it’s not simple process such as “just throw data into SVM”. Getting, checking, processing, modeling, evaluation. There are many steps you need to care about.
On this article, by making regression model on R, I’ll show the example of part of the process.
Check the data
When we make model by data, we at first need to check the data themselves. The features can have Nan values, correlation with each features, and so on. It’s necessary to check those characteristics by statistical values and visualization.
I used airquality data set, which R has as default.
Simply, we can check basic statistical information by following command on R.
summary(airquality)
The output is as below. We can see the statistical data of each feature and the existense of Nan. The function, summary(), can literally show summary of the data. When you want to check the data, this is useful.
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
Although the output above is informative, it’s not so easy to understand the data at a glance. By making histogram of each feature, the distributions of each feature are shown.
attach(airquality)
par(mfrow=c(4,1))
hist(Ozone)
hist(Solar.R)
hist(Wind)
hist(Temp)
By visualization, we can know the distributions at a glance.
Next, we should check the relations between each feature. In R, cor() function returns correlation coefficients between features.
cor(airquality[,1:4], use="complete.obs")
Ozone Solar.R Wind Temp
Ozone 1.0000000 0.3483417 -0.6124966 0.6985414
Solar.R 0.3483417 1.0000000 -0.1271835 0.2940876
Wind -0.6124966 -0.1271835 1.0000000 -0.4971897
Temp 0.6985414 0.2940876 -0.4971897 1.0000000
To check the relations between features, scatter diagram is also effective.
pairs(airquality[,1:4])
On this case, some correlations, (Ozone, Wind) and (Ozone, Temp), can be observed.
Make model
After chicking data, we can tackle with making model. On this article, I used linear regression.
At first, let’s make the simplest model.
airquality.lm <- lm(Ozone~Temp+Solar.R+Wind, data=airquality)
The function, summary(), can be used here too.
summary(airquality.lm)
Call:
lm(formula = Ozone ~ Temp + Solar.R + Wind, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-40.485 -14.219 -3.551 10.097 95.619
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -64.34208 23.05472 -2.791 0.00623 **
Temp 1.65209 0.25353 6.516 2.42e-09 ***
Solar.R 0.05982 0.02319 2.580 0.01124 *
Wind -3.33359 0.65441 -5.094 1.52e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 21.18 on 107 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16
We can get diagnostic plots by following.
plot(airquality.lm)
The model above doesn’t think about mutual interaction between features. So, let’s make model with thinking about that.
airquality.interect.lm <- lm(Ozone~(Temp+Solar.R+Wind)^2, data=airquality)
summary(airquality.interect.lm)
Call:
lm(formula = Ozone ~ (Temp + Solar.R + Wind)^2, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-38.685 -11.727 -2.169 7.360 91.244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.408e+02 6.419e+01 -2.193 0.03056 *
Temp 2.322e+00 8.330e-01 2.788 0.00631 **
Solar.R -2.260e-01 2.107e-01 -1.073 0.28591
Wind 1.055e+01 4.290e+00 2.460 0.01555 *
Temp:Solar.R 5.061e-03 2.445e-03 2.070 0.04089 *
Temp:Wind -1.613e-01 5.896e-02 -2.735 0.00733 **
Solar.R:Wind -7.231e-03 6.688e-03 -1.081 0.28212
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 19.17 on 104 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6863, Adjusted R-squared: 0.6682
F-statistic: 37.93 on 6 and 104 DF, p-value: < 2.2e-16
The coefficient of determination improved from 0.6059 to 0.6863(adjusted coefficient of determination also improved).
But some explaining variables have big Pr(>|t|). For better model, we think to choose the variables.
One of the methods to choose variables is to use AIC.
aic <- step(airquality.interect.lm)
print(aic)
Start: AIC=662.37
Ozone ~ (Temp + Solar.R + Wind)^2
Df Sum of Sq RSS AIC
- Solar.R:Wind 1 429.42 38635 661.61
<none> 38205 662.37
- Temp:Solar.R 1 1574.75 39780 664.86
- Temp:Wind 1 2748.20 40954 668.08
Step: AIC=661.61
Ozone ~ Temp + Solar.R + Wind + Temp:Solar.R + Temp:Wind
Df Sum of Sq RSS AIC
<none> 38635 661.61
- Temp:Solar.R 1 2141.1 40776 665.60
- Temp:Wind 1 4339.8 42975 671.43
Call:
lm(formula = Ozone ~ Temp + Solar.R + Wind + Temp:Solar.R + Temp:Wind,
data = airquality)
Coefficients:
(Intercept) Temp Solar.R Wind Temp:Solar.R
-1.368e+02 2.451e+00 -3.531e-01 1.115e+01 5.717e-03
Temp:Wind
-1.863e-01
From the final output, we can say Temp*Sorlar.R and Temp:Wind set should be added to the model and those coefficients are as above.
After making model, we usually check and evaluate that.