Overview
On the article, Simple Bayesian modeling by Stan, I made a simple linear regression by Stan and PyStan. So, as an extension of it, I made multiple regression model on the same manner to show how to do Bayesian modeling roughly.Data
On this article, I used air quality data set, which R has as a default. By executing the following line on R console, you can write out air quality data set.
write.csv(airquality, "airquality.csv", row.names=FALSE)
On Python, let’s read and check the data.
import pandas as pd
airquality = pd.read_csv('airquality.csv')
print(airquality.head())
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
As you can see, it has some missing values. We can check the existence of missing values by the following line.
print(airquality.isnull().any())
Ozone True
Solar.R True
Wind False
Temp False
Month False
Day False
dtype: bool
To deal with the missing values, check the ratio of those.
print(airquality.isnull().sum()/airquality.shape[0])
Ozone 0.241830
Solar.R 0.045752
Wind 0.000000
Temp 0.000000
Month 0.000000
Day 0.000000
dtype: float64
On this data, 24% of “Ozone” are missing. “Solar.R” also has 4% missing values. The purpose of this article is to make multiple regression model roughly. So this time, simply by list-wise case deletion, I dealt with those.
# list-wise case deletion
airquality_without_na = airquality.dropna()
print(airquality_without_na.isnull().any())
Ozone False
Solar.R False
Wind False
Temp False
Month False
Day False
dtype: bool
Before making model, to check the relations between variables and the distributions, we should visualize the data.
import matplotlib.pyplot as plt
import seaborn as sns
airquality_data = airquality_without_na[["Ozone", "Solar.R", "Wind", "Temp"]]
sns.pairplot(airquality_data)
plt.show()
Stan modeling
On the same manner as the article below, I wrote the model and saved it to the file, airquality.stan.Simple Bayesian modeling by Stan
About Bayesian modeling, we can use some languages and tools. BUGS, PyMC, Stan. On this article, I made simple regression model by using Stan from Python. To try simple regression, I used the data set, Speed and Stopping Distances of Cars. R has the data set as default.
The model is as following.
data {
int N;
real Ozone[N];
real Solar[N];
real Wind[N];
real Temp[N];
}
parameters {
real b1;
real b2;
real b3;
real b4;
real<lower=0> sigma;
}
model {
for (i in 1:N)
Ozone[i] ~ normal(b1 + b2 * Solar[i] + b3 * Wind[i] + b4 * Temp[i], sigma);
}
Execute
On the Python side, the following code passes the data to the model.
import pystan
data = {'N':airquality_data.shape[0], 'Ozone': airquality_data['Ozone'], 'Solar':airquality_data['Solar.R'], 'Wind':airquality_data['Wind'], 'Temp':airquality_data['Temp']}
fit = pystan.stan(file='airquality.stan', data=data, seed=42)
Let’s check the summary and plot.
print(fit)
Inference for Stan model: anon_model_ddb6e6d0ac3710f4cc729820ed2ff084.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
b1 -65.03 0.58 23.19 -110.8 -80.81 -64.87 -49.36 -19.15 1618 1.0
b2 0.06 4.5e-4 0.02 0.01 0.04 0.06 0.08 0.11 2801 1.0
b3 -3.31 0.01 0.66 -4.62 -3.77 -3.31 -2.86 -2.0 2106 1.0
b4 1.66 6.1e-3 0.25 1.16 1.49 1.66 1.83 2.15 1740 1.0
sigma 21.48 0.03 1.47 18.86 20.42 21.39 22.43 24.51 2675 1.0
lp__ -391.8 0.04 1.57 -395.7 -392.7 -391.6 -390.7 -389.7 1689 1.0
Samples were drawn using NUTS at Sat Oct 7 22:39:53 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
fit.plot()