Tuesday, October 10, 2017

Bayesian modeling to data with heteroscedasticity by Stan

Before, I wrote about the data with heteroscedasticity.


What is heteroscedasticity and How to check it on R

Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have. Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.

How to deal with heteroscedasticity

On the article below, I wrote about heteroscedasticity. Linear regression with OLS is simple and strong method to analyze data. By the coefficients, we can know the influence each variables have. Although it looks easy to use linear regression with OLS because of the simple system from the viewpoint of necessary code and mathematics, it has some important conditions which should be kept to get proper coefficients and characteristics.
This time, I’ll make the model again but with Python and Stan.



Data

About the detail of heteroscedasticity and the data, please check the articles above. Concisely, by the code below, I made the data that as the value becomes bigger, the variance becomes bigger.

import numpy as np

x = range(1, 100)

y_homogeneous = []
y_heteroscedastic = []
for i in x:
    y_homogeneous.append(i * 100 + 100 + np.random.normal(i, 100))

    err = np.random.normal(i, i * 50)
    if err <= -(i * 100 + 100):
        err = - (i * 100 + 100 - 1)

    y_heteroscedastic.append(i * 100 + 100 + err)

To check how the data are, we can visualize those.

import matplotlib.pyplot as plt
plt.subplot(1, 2, 1)
plt.scatter(x, y_homogeneous)
plt.subplot(1, 2, 2)
plt.scatter(x, y_heteroscedastic)
plt.show()

enter image description here

The left image shows the data, (x, y_homogeneous), meaning it doesn’t have heteroscedasticity. The right image shows the data with heteroscedasticity. As you can see, the variance becomes bigger as the value, x, becomes bigger.

Stan modeling


When you use Stan, you need to write model and save it to the file. On this case, I wrote the model below and saved it to heteroscedastic.stan.

data {
    int N;
    real X[N];
    real Y[N];
}

parameters {
    real a;
    real b;
    real<lower=0> sigma;
}

model {
    for (i in 1:N)
        Y[i] ~ normal(a + b * X[i], sigma);
}

Execute

To get sampled points, we can use pystan on Python.
At first, I executed the sampling to the data as it is.

import pystan
data = {'X': x, 'Y': y_heteroscedastic, 'N':len(x)}
fit = pystan.stan(file='heteroscedastic.stan', data=data)

The variable has the summary.

print(fit)
Inference for Stan model: anon_model_e182d2238e6672fa6ba3455e960016c7.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
a     184.46    7.29 318.67 -463.6 -27.99 182.04  402.9  803.5   1913    1.0
b     114.56    0.13    5.6 103.88  110.8 114.46 118.28 125.69   1869    1.0
sigma 1554.5    2.49 116.37 1350.7 1474.5 1546.1 1625.5 1805.1   2182    1.0
lp__  -768.9    0.04   1.29 -772.2 -769.5 -768.6 -768.0 -767.4   1216    1.0

Samples were drawn using NUTS at Mon Oct  9 22:54:08 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The distribution is like below.

fit.plot()

enter image description here

One of the method to solve heteroscedasticity is to use . I used this as I did before.

import math
x_log = []
y_heteroscedastic_log = []
for x_val, y_val in zip(x, y_heteroscedastic):
    x_log.append(math.log(x_val))
    y_heteroscedastic_log.append(math.log(y_val))

data_log = {'X': x_log, 'Y': y_heteroscedastic_log, 'N':len(x)}
fit_log = pystan.stan(file='heteroscedastic.stan', data=data_log)

The summary is like below.

print(fit_log)
Inference for Stan model: anon_model_e182d2238e6672fa6ba3455e960016c7.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
a       5.06  2.2e-3   0.08    4.9   5.01   5.06   5.11   5.22   1244    1.0
b       0.92  6.0e-4   0.02   0.88   0.91   0.92   0.94   0.96   1253    1.0
sigma    0.2  3.4e-4   0.01   0.17   0.19    0.2   0.21   0.23   1752    1.0
lp__  108.94    0.03    1.2  105.8 108.37 109.27  109.8 110.28   1340    1.0

Samples were drawn using NUTS at Mon Oct  9 22:54:50 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
fit_log.plot()

enter image description here

With Stan, we can make Bayesian model to the data with heteroscedasticity as the same manner.