Saturday, September 30, 2017

How to deal with heteroscedasticity

On the article below, I wrote about heteroscedasticity.

What is heteroscedasticity and How to check it on R

Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have. Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.


Roughly, with heteroscedasticity, we can’t get OLS’s nice feature, unbiasedness. And plot and some tests such as Breusch-Pagan test reveal the existence of heteroscedasticity.

After knowing the problem, of course we need to know how to solve it.
Here on this article, I’ll write about how to deal with this heteroscedasticity.

I’ll use same data here as the article above.



Data


Just in case, let’s check the data.

x <- seq(1, 100)
y_heteroscedastic <- x * 5 + 100 + rnorm(100, x, 2 * x) 

On R, the code above makes the data. As the value x grows, the value of
 y_heteroscedastic has bigger standard deviation.
line1 <- lm(y_heteroscedastic~x)
plot(x, y_heteroscedastic)
abline(line1, col='red')

By simple linear regression with OLS, we can draw the red line on the data plot.

enter image description here

On this artificial data, you can visually see the heteroscedasticity. But in many cases, visual checking is not enough. We usually should do the test as good manner.

bptest(line1)
studentized Breusch-Pagan test

data:  line1
BP = 18.76, df = 1, p-value = 1.483e-05

By the test, we can acknowledge the existence of heteroscedasticity.

How to deal with heteroscedasticity?


There are some methods to solve heteroscedasticity.
When we deal with it, we need to select the method which does fit the characteristics of the data. Concretely, we should think about the reason the variable has heteroscedasticity.

But at first, let’s check the simple method.

Log-transformation


As the simple solver, log-transformation can be one of the candidates. When log() takes the numbers, the difference between big and small numbers relatively becomes small.
enter image description here

So, on the case that as the value grows, the standard deviation grows, this can suppress the differences.

But it has restriction. log() can take only positive number. If the variable can take negative one, we should not use this method.

Anyway, let’s try log-transformation.

log_x <- log(x)
log_y_heteroscedastic <- log(y_heteroscedastic)

log_line <- lm(log_y_heteroscedastic~log_x)

This code makes log-transforamtion model.
Does this solve heteroscedasticity? By Breusch-Pagan test, we can see the outcome.

bptest(log_line)
studentized Breusch-Pagan test

data:  log_line
BP = 0.028403, df = 1, p-value = 0.8662

The value of BP is almost 0 and p-value is enough big. We can say heteroscedasticity is not observed.

Although the difference about plot between log-transformation method and original data is not so intuitive, the plot lets us know that it has homoskedascity.

plot(log_x, log_y_heteroscedastic)
abline(log_line, col='red')
enter image description here

Heteroscedasticity structure


Log-transformation is one of the methods to solve the heteroscedasticity. However when we know the structure of the heteroscedasticity, we can choose better manner way.

Although usually I should touch GLS here, it becomes long. So I’ll just introduce rough way in specific case.

As one example which has heteroscedasticity, we can think the following case. The data is aggregated one, meaning, for example, the explaning variable is the mean of group’s something. On this case, the variable’s variance is influenced by the scale of the group.

In this kind of situation, one of the solvers to heteroscedasticity is to multiply each values by , the number of items on the group.

For example, when the data point means the U.S’s states and as explaining variable those have the means of consumption per houses, by multiplying each values by square root of the number of houses in the state, we can get homoskedascity.