Overview
About Bayesian modeling, we can use some languages and tools. BUGS, PyMC, Stan. On this article, I made simple regression model by using Stan from Python.Data
To try simple regression, I used the data set, Speed and Stopping Distances of Cars. R has the data set as default. So to get the data, type the following line on R console.write.csv(cars, "cars.csv", row.names=FALSE)
On Python, we can read the data.
import pandas as pd
cars = pd.read_csv('cars.csv')
print(cars)
speed dist
0 4 2
1 4 10
2 7 4
3 7 22
4 8 16
5 9 10
6 10 18
7 10 26
8 10 34
9 11 17
10 11 28
11 12 14
12 12 20
13 12 24
14 12 28
15 13 26
16 13 34
17 13 34
18 13 46
19 14 26
20 14 36
21 14 60
22 14 80
23 15 20
24 15 26
25 15 54
26 16 32
27 16 40
28 17 32
29 17 40
30 17 50
31 18 42
32 18 56
33 18 76
34 18 84
35 19 36
36 19 46
37 19 68
38 20 32
39 20 48
40 20 52
41 20 56
42 20 64
43 22 66
44 23 54
45 24 70
46 24 92
47 24 93
48 24 120
49 25 85
As you can see, this data has just two columns, speed and dist. I used “speed” as explaining variable and “dist” as explained variable.
Stan modeling
When we use Stan, we need to write the model and save it to stan file. I wrote the following model and saved it to “cars.stan”.
data{
int N;
real X[N];
real Y[N];
}
parameters{
real a;
real b;
real<lower=0> sigma;
}
model {
for (i in 1:N){
Y[i] ~ normal(a * X[i] + b, sigma);
}
}
This has three parts, data, parameters and model. Simply, the roles of each part are as followings.
- data: The data which is used on the model part
- parameters: The target to estimate
- model: The model
Intuitively, the data part is similar to placeholder on TensorFlow from the viewpoint of the code.
On this case, the model means that Y follows normal distribution whose mean is the linear combination, and standard deviation is sigma.
Execute
To calculate and get sampled points, we need to pass the data to the model on Python.(Actually, you can also use R.)
You can install pystan by pip.
pip install pystan
It is easy to pass the data to the model.
import pystan
data = {'X':cars['speed'], 'Y':cars['dist'], 'N':cars.shape[0]}
fit = pystan.stan(file='cars.stan', data=data, seed=42)
The code above executes the calculation.
The variable, fit, has summary of the sampled points.
print(fit)
Inference for Stan model: anon_model_a754b000381b0bd871cba7120173c0e4.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
a 3.94 0.01 0.43 3.1 3.65 3.93 4.22 4.81 1437 1.0
b -17.61 0.19 7.06 -31.57 -22.19 -17.54 -13.01 -4.0 1443 1.0
sigma 15.83 0.04 1.68 12.98 14.66 15.68 16.83 19.55 2174 1.0
lp__ -159.4 0.04 1.3 -162.7 -160.0 -159.1 -158.5 -158.0 1178 1.0
Samples were drawn using NUTS at Sat Oct 7 08:22:10 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
When you want to get the plot of the sampled points. Simply, you can use the function, plot(). fit.plot()
By using the estimated values, let’s draw the linear regression line.
import matplotlib.pyplot as plt
import numpy as np
samples = fit.extract(permuted=True)
x = np.linspace(0, max(cars['speed']))
y = samples['a'].mean() * x + samples['b'].mean()
plt.figure()
plt.plot(cars['speed'], cars['dist'], "o")
plt.plot(x, y, "r-")
plt.show()
Related Posts
Bayesian multiple regression by Stan
On the article, Simple Bayesian modeling by Stan, I made a simple linear regression by Stan and PyStan. So, as an extension of it, I made multiple regression model on the same manner to show how to do Bayesian modeling roughly. On this article, I used air quality data set, which R has as a default.