Wednesday, October 11, 2017

Hierarchical Bayesian model by Stan: Struggling

I’ll try to make Hierarchical Bayesian model to the artificial data by Stan. Hierarchical Bayesian model lets us write the model with a high degree of freedom.

From Wikipedia,
Bayesian hierarchical modelling is a statistical model written in multiple levels (hierarchical form) that estimates the parameters of the posterior distribution using the Bayesian method.[1] The sub-models combine to form the hierarchical model, and the Bayes’ theorem is used to integrate them with the observed data, and account for all the uncertainty that is present. The result of this integration is the posterior distribution, also known as the updated probability estimate, as additional evidence on the prior distribution is acquired.


About simpler model, please check the articles below.

Simple Bayesian modeling by Stan

About Bayesian modeling, we can use some languages and tools. BUGS, PyMC, Stan. On this article, I made simple regression model by using Stan from Python. To try simple regression, I used the data set, Speed and Stopping Distances of Cars. R has the data set as default.

Data


This is just a trial. So I made simple and easy artificial data by myself.

import numpy as np
np.random.seed(49)

x_1 = list(range(1, 100))
x_2 = list(range(1, 50))
x_3 = list(range(1, 3))

y_1 = [np.random.normal(3.0 * i + 4, 4.0) for i in x_1]
y_2 = [np.random.normal(4.0 * i + 5, 4.0) for i in x_2]
y_3 = [np.random.normal(2.0 * i + 2, 20.0) for i in x_3]

By visualizing, we can see the behavior of it.

import matplotlib.pyplot as plt

plt.plot(x_1, y_1)
plt.plot(x_2, y_2)
plt.plot(x_3, y_3)
plt.show()
enter image description here

Basically, this data has three types of points. Those three have different sample sizes. The slope and interception are also different. But as a premise, the background theory to generate y_ from x_ is same.

X = x_1 + x_2 + x_3
Y = y_1 + y_2 + y_3
By concatenating those, the data preparation was done.

Simple model


At first, I made the model without thinking about the differences the data have inside, meaning the model doesn’t distinguish those three.
On Stan file, I wrote the following model and saved it to “normal.stan”.

data {
    int N;
    int X[N];
    real Y[N];
}

parameters {
    real a;
    real b;
    real<lower=0> sigma;
}

model {
    for (i in 1:N)
        Y[i] ~ normal(a + b * X[i], sigma);
}

On Python, the following code executes the sampling.

import pystan

data = {'X': X, 'Y': Y, 'N':len(X)}
fit = pystan.stan(file='normal.stan', data=data)

By printing the “fit”, we can check the summary of it.

print(fit)
Inference for Stan model: anon_model_4ccb91f7c1ba187d6bb0f518f9eb36cc.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
a      16.26    0.05   2.17  12.01  14.78  16.32  17.76  20.42   1908    1.0
b       2.91  9.8e-4   0.04   2.82   2.88   2.91   2.94   2.99   1965    1.0
sigma  14.96    0.02    0.9  13.28  14.35  14.92  15.52  16.82   2551    1.0
lp__  -476.9    0.03   1.26 -480.3 -477.5 -476.6 -476.0 -475.5   1676    1.0

Samples were drawn using NUTS at Tue Oct 10 23:55:52 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

On the summary, we can’t find apparent problem. All the Rhats have no problem and the slope, b, is enough bigger than zero.

fit.plot()
enter image description here

The plot also has no problem.
Next, I plotted the estimated line on the plot of (X, Y) set.

import math
x = np.arange(1, 100, 0.01)
y = 16.26 + 2.91 * x

plt.scatter(X, Y)
plt.plot(x, y, 'r')
plt.show()
enter image description here

The red line is drawn by the estimated values. As you can see, it is not proper to express the data with three types of points by one line.

Hierarchical Bayesian model

To make hierarchical bayesian model, I added group id information to the data.
x_1_sign = [1 for i in x_1]
x_2_sign = [2 for i in x_2]
x_3_sign = [3 for i in x_3]
GID = x_1_sign + x_2_sign + x_3_sign

On the simple model case, we set the model as following.

Here, interception, , and slope, , can be separated into common part and the group differences.


So, the model becomes as followings.


The model on Stan can be written like followings. I saved it to the file “hierarchical.stan”.

data {
    int N;
    int G;
    int X[N];
    real Y[N];
    int<lower=1, upper=G> GID[N];
}

parameters {
    real a0;
    real b0;
    real ag[G];
    real bg[G];
    real<lower=0> sigma_a;
    real<lower=0> sigma_b;
    real<lower=0> sigma_Y;
}

transformed parameters {
    real a[G];
    real b[G];

    for (g in 1:G){
        a[g] = a0 + ag[g];
        b[g] = b0 + bg[g];
    }
}

model {
    for (g in 1:G) {
        ag[g] ~ normal(0, sigma_a);
        bg[g] ~ normal(0, sigma_b);
    }
    for (i in 1:N){
        Y[i] ~ normal(a[GID[i]] + b[GID[i]] * X[i], sigma_Y);
    }
}

On Python, let’s execute the sampling.

data_hier = {'X': X, 'Y': Y, 'N':len(X), 'GID':GID, 'G':3}
fit = pystan.stan(file='hierarchical.stan', data=data_hier, iter=10000)

The summary is like below. The Rhats of many parameters are big and the range of b0 is from negative to positive, because Group 3 has a small amount of points.

print(fit)
Inference for Stan model: anon_model_3f0dce44b23e650381c177b45c857744.
4 chains, each with iter=10000; warmup=5000; thin=1; 
post-warmup draws per chain=5000, total post-warmup draws=20000.

          mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
a0        7.66   12.65  43.81 -95.24 -15.36  10.75  32.41  82.86     12   1.48
b0       15.15   30.32  60.65 -83.62 -14.79    2.7   32.4 169.96      4   2.44
ag[0]    -2.87   12.65  43.81 -78.11 -27.71  -5.94  20.14  99.97     12   1.48
ag[1]     -3.3   12.65  43.82 -78.45 -27.98  -6.35  19.59  99.35     12   1.48
ag[2]    33.89    13.0  45.02 -42.27   6.67  28.17  57.15 138.83     12   1.45
bg[0]   -12.16   30.32  60.65 -166.9  -29.4   0.29  17.79  86.61      4   2.44
bg[1]   -11.15   30.32  60.65 -165.9 -28.37    1.3  18.79  87.64      4   2.44
bg[2]   -40.12   30.44  60.88 -193.5 -58.65 -26.92   -8.5  59.15      4   2.42
sigma_a  66.09    7.69  65.23   9.08  27.52  48.05   81.6 241.24     72   1.06
sigma_b  102.4   62.71 140.22    9.7  24.93  52.27 116.92 539.87      5   1.34
sigma_Y   3.97    0.03   0.21   3.55   3.83   3.97   4.11   4.41     58   1.06
a[0]      4.78  5.6e-3    0.8   3.25   4.24   4.78   5.32   6.35  20000    1.0
a[1]      4.35  8.2e-3   1.16   2.06   3.59   4.35   5.14   6.64  20000    1.0
a[2]     41.55    0.79   9.88  20.44  35.55  41.88  48.05   60.3    158   1.02
b[0]      2.99  9.8e-5   0.01   2.96   2.98   2.99    3.0   3.02  20000    1.0
b[1]       4.0  2.8e-4   0.04   3.92   3.97    4.0   4.03   4.08  20000    1.0
b[2]    -24.98    0.49   6.22 -36.79 -29.05 -25.14 -21.12 -11.85    162   1.02
lp__    -299.1    1.32   2.95 -305.2 -301.2 -298.9 -296.9 -294.0      5   1.35

Samples were drawn using NUTS at Wed Oct 11 00:05:21 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The sampled points has this frequency.

fit.plot()
enter image description here

Reference