What is heteroscedasticity and How to check it on R
Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have. Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.
This time, I’ll make the model again but with Python and Stan.How to deal with heteroscedasticity
On the article below, I wrote about heteroscedasticity. Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have. Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.
Data
About the detail of heteroscedasticity and the data, please check the articles above. Concisely, by the code below, I made the data that as the value becomes bigger, the variance becomes bigger.import numpy as np
x = range(1, 100)
y_homogeneous = []
y_heteroscedastic = []
for i in x:
y_homogeneous.append(i * 100 + 100 + np.random.normal(i, 100))
err = np.random.normal(i, i * 50)
if err <= -(i * 100 + 100):
err = - (i * 100 + 100 - 1)
y_heteroscedastic.append(i * 100 + 100 + err)
To check how the data are, we can visualize those.
import matplotlib.pyplot as plt
plt.subplot(1, 2, 1)
plt.scatter(x, y_homogeneous)
plt.subplot(1, 2, 2)
plt.scatter(x, y_heteroscedastic)
plt.show()
The left image shows the data, (x, y_homogeneous), meaning it doesn’t have heteroscedasticity. The right image shows the data with heteroscedasticity. As you can see, the variance becomes bigger as the value, x, becomes bigger.
Stan modeling
When you use Stan, you need to write model and save it to the file. On this case, I wrote the model below and saved it to heteroscedastic.stan.
data {
int N;
real X[N];
real Y[N];
}
parameters {
real a;
real b;
real<lower=0> sigma;
}
model {
for (i in 1:N)
Y[i] ~ normal(a + b * X[i], sigma);
}
Execute
To get sampled points, we can use pystan on Python.At first, I executed the sampling to the data as it is.
import pystan
data = {'X': x, 'Y': y_heteroscedastic, 'N':len(x)}
fit = pystan.stan(file='heteroscedastic.stan', data=data)
The variable has the summary.
print(fit)
Inference for Stan model: anon_model_e182d2238e6672fa6ba3455e960016c7.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
a 184.46 7.29 318.67 -463.6 -27.99 182.04 402.9 803.5 1913 1.0
b 114.56 0.13 5.6 103.88 110.8 114.46 118.28 125.69 1869 1.0
sigma 1554.5 2.49 116.37 1350.7 1474.5 1546.1 1625.5 1805.1 2182 1.0
lp__ -768.9 0.04 1.29 -772.2 -769.5 -768.6 -768.0 -767.4 1216 1.0
Samples were drawn using NUTS at Mon Oct 9 22:54:08 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
The distribution is like below.
fit.plot()
One of the method to solve heteroscedasticity is to use . I used this as I did before.
import math
x_log = []
y_heteroscedastic_log = []
for x_val, y_val in zip(x, y_heteroscedastic):
x_log.append(math.log(x_val))
y_heteroscedastic_log.append(math.log(y_val))
data_log = {'X': x_log, 'Y': y_heteroscedastic_log, 'N':len(x)}
fit_log = pystan.stan(file='heteroscedastic.stan', data=data_log)
The summary is like below.
print(fit_log)
Inference for Stan model: anon_model_e182d2238e6672fa6ba3455e960016c7.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
a 5.06 2.2e-3 0.08 4.9 5.01 5.06 5.11 5.22 1244 1.0
b 0.92 6.0e-4 0.02 0.88 0.91 0.92 0.94 0.96 1253 1.0
sigma 0.2 3.4e-4 0.01 0.17 0.19 0.2 0.21 0.23 1752 1.0
lp__ 108.94 0.03 1.2 105.8 108.37 109.27 109.8 110.28 1340 1.0
Samples were drawn using NUTS at Mon Oct 9 22:54:50 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
fit_log.plot()
With Stan, we can make Bayesian model to the data with heteroscedasticity as the same manner.