Sunday, October 8, 2017

Bayesian multiple regression by Stan

Overview

On the article, Simple Bayesian modeling by Stan, I made a simple linear regression by Stan and PyStan. So, as an extension of it, I made multiple regression model on the same manner to show how to do Bayesian modeling roughly.



Data


On this article, I used air quality data set, which R has as a default. By executing the following line on R console, you can write out air quality data set.

write.csv(airquality, "airquality.csv", row.names=FALSE)

On Python, let’s read and check the data.

import pandas as pd

airquality = pd.read_csv('airquality.csv')
print(airquality.head())
 Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5

As you can see, it has some missing values. We can check the existence of missing values by the following line.

print(airquality.isnull().any())
Ozone       True
Solar.R     True
Wind       False
Temp       False
Month      False
Day        False
dtype: bool

To deal with the missing values, check the ratio of those.

print(airquality.isnull().sum()/airquality.shape[0])
Ozone      0.241830
Solar.R    0.045752
Wind       0.000000
Temp       0.000000
Month      0.000000
Day        0.000000
dtype: float64

On this data, 24% of “Ozone” are missing. “Solar.R” also has 4% missing values. The purpose of this article is to make multiple regression model roughly. So this time, simply by list-wise case deletion, I dealt with those.

# list-wise case deletion
airquality_without_na = airquality.dropna()

print(airquality_without_na.isnull().any())
Ozone      False
Solar.R    False
Wind       False
Temp       False
Month      False
Day        False
dtype: bool

Before making model, to check the relations between variables and the distributions, we should visualize the data.

import matplotlib.pyplot as plt
import seaborn as sns
airquality_data = airquality_without_na[["Ozone", "Solar.R", "Wind", "Temp"]]
sns.pairplot(airquality_data)
plt.show()

enter image description here

Stan modeling

On the same manner as the article below, I wrote the model and saved it to the file, airquality.stan.

Simple Bayesian modeling by Stan

About Bayesian modeling, we can use some languages and tools. BUGS, PyMC, Stan. On this article, I made simple regression model by using Stan from Python. To try simple regression, I used the data set, Speed and Stopping Distances of Cars. R has the data set as default.


The model is as following.

data {
    int N;
    real Ozone[N];
    real Solar[N];
    real Wind[N];
    real Temp[N];
}

parameters {
    real b1;
    real b2;
    real b3;
    real b4;
    real<lower=0> sigma;
}

model {
    for (i in 1:N)
        Ozone[i] ~ normal(b1 + b2 * Solar[i] + b3 * Wind[i] + b4 * Temp[i], sigma);
}

Execute


On the Python side, the following code passes the data to the model.

import pystan
data = {'N':airquality_data.shape[0], 'Ozone': airquality_data['Ozone'], 'Solar':airquality_data['Solar.R'], 'Wind':airquality_data['Wind'], 'Temp':airquality_data['Temp']}
fit = pystan.stan(file='airquality.stan', data=data, seed=42)

Let’s check the summary and plot.

print(fit)
Inference for Stan model: anon_model_ddb6e6d0ac3710f4cc729820ed2ff084.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
b1    -65.03    0.58  23.19 -110.8 -80.81 -64.87 -49.36 -19.15   1618    1.0
b2      0.06  4.5e-4   0.02   0.01   0.04   0.06   0.08   0.11   2801    1.0
b3     -3.31    0.01   0.66  -4.62  -3.77  -3.31  -2.86   -2.0   2106    1.0
b4      1.66  6.1e-3   0.25   1.16   1.49   1.66   1.83   2.15   1740    1.0
sigma  21.48    0.03   1.47  18.86  20.42  21.39  22.43  24.51   2675    1.0
lp__  -391.8    0.04   1.57 -395.7 -392.7 -391.6 -390.7 -389.7   1689    1.0

Samples were drawn using NUTS at Sat Oct  7 22:39:53 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
fit.plot()
enter image description here

Reference