Friday, September 29, 2017

What is heteroscedasticity and How to check it on R

Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have.

Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.

On this article, I’ll show the way to check heteroscedasticity.



Why OLS?


When we analyze data, linear regression with OLS can be very efficient method. We can get the information of the influence the variables have. Because it is simple method, the form of the model directly leads to the interruption.

It has following four characteristics.
  • linearity
  • unbiasedness
  • efficiency
  • consistency

From Wikipedia
Unbiasedness
In statistics, the bias (or bias function) of an estimator is the difference between this estimator’s expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased.

Efficiency
In the comparison of various statistical procedures, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, experiment, or test needs fewer observations than a less efficient one to achieve a given performance.

Consistency
In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter θ0—having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to θ0.

Those characteristics make the OLS strong method. But those are not always kept. To obtain the characteristics, three conditions below should be fulfilled.

The error should be
  • homoscedasticity
  • uncorrelated
  • independent of the explaining variables

What is heteroscedasticity?


When we want OLS to have unbiasedness to the estimators, It’s necessary to care about the homoscedasticity. About this point, let’s check with R code.

x <- seq(1, 100)
y_homogeneous <- x * 5 + 100 + rnorm(100, x, 5)
y_heteroscedastic <- x * 5 + 100 + rnorm(100, x, 2 * x) 

On the code above, rnorm() makes the error. y_homogeneous’s error item follows gaussian distribution with the parameters {mean: x, sd: 5}. Because the standard deviation is constant, you can say that the data with x and y_homogeneous fulfill the condition of homoskedascity.

On the other hand, y_heteroscedastic’s error item is norm(100, x, 2 * x). It means the standard deviation becomes bigger as the value x grows.
By plot, we can visually grasp the data.

par(mfrow=c(1,2))
plot(x, y_homogeneous, col = 'blue')
plot(x, y_heteroscedastic, col = 'green')
enter image description here

How to check heteroscedasticity?


On the case above, we can feel heteroscedasticity by plot. But in many cases, it’s not so easy to know the existence of heteroscedasticity.

To check the existence of heteroscedasticity, we can use some tests.
On this article, I used Breusch–Pagan test as the test and x-y_heteroscedastic data as data with heteroscedasticity.
 
At first, the code reads the libraries and makes linear model.

library("zoo")
library("lmtest")

line1 <- lm(y_heteroscedastic~x)
summary(line1)
Call:
lm(formula = y_heteroscedastic ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-284.12  -57.82   -3.74   46.60  337.88 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 108.5503    22.3559   4.856 4.54e-06 ***
x             6.0149     0.3843  15.650  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 110.9 on 98 degrees of freedom
Multiple R-squared:  0.7142,    Adjusted R-squared:  0.7113 
F-statistic: 244.9 on 1 and 98 DF,  p-value: < 2.2e-16

To check visually, we can plot the data and the fitted line.

plot(x, y_heteroscedastic)
abline(line1, col='red')
enter image description here

On this case, we can see heteroscedasticity. But we should not rely only on the visual judgement of that.

On R, it’s easy to do the test.

bptest(line1)
    studentized Breusch-Pagan test

data:  line1
BP = 19.84, df = 1, p-value = 8.421e-06

When the data have homoskedascity, the value of BP gathers around 0 and the p-value becomes big although the threshold of that depends on your judgement. On this case, the BP value is big and p-value is smaller than 0.01. We can say the data has heteroscedasticity.

Of course when we observe heteroscedasticity, there are choices to deal with that. On the next article, I wrote about it.

NEXT ARTICLE

How to deal with heteroscedasticity

On the article below, I wrote about heteroscedasticity. Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have. Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.