Abstract
On this article, I'll check the EM algorithm with the initialized values by k-means. In many algorithms, initial values are very important theme. On EM algorithm, with inappropriate initial values, it takes much time for convergence and if the algorithm is naively written, it stops with error because of non-positive-definite.So, I'll do experiment to check how much the accuracy, the number of iteration and time change with and without initialized values by k-means. Here, I'll just touch EM algorithm about the mixture of Gaussian case.
If there is a mistake or inappropriate points, please let me know by comment.
Code of EM and k-means
I'll use the codes of EM algorithm and k-means from my GitHub.
Those code were written through those articles below. If you want to know the detail, please check the articles.
- Probabilistic Clustering with EM algorithm: Algorithm and Visualization with Julia from scratch
- Introduction to K-means: Algorithm and Visualization with Julia from scratch
Overview of experiment
On EM algorithm, by the repetition of E-step and M-step, the posterior probabilities and the parameters are updated. From the article, Probabilistic Clustering with EM algorithm: Algorithm and Visualization with Julia from scratch, the GIF image below shows how cluster is built. We can observe the center point of cluster is moving in the loop. Actually as an another important parameter, there is covariance. But, on this image, we can’t see.
Anyway, what I want to do is to give better initialized values to EM algorithm. By k-means algorithm, I'll give better center points, covariances and mixture rate. And by comparison with usual EM algorithm, I'll try to see the characteristics.
The target data is Iris data set and the codes I'll use are on my GitHub.
I'll measure following three points.
- the number of iteration
- accuracy
- time
Experiment
After cloning from my GitHub, we will read the source codes.
include("./Clustering/src/kmeans.jl")
include("./EM/src/EM.jl")
Also, we need to read Iris data set from RDatasets.
using RDatasets
iris = dataset("datasets", "iris")
The iris data set is DataFrame as a default. So, I need to convert it to matrix so that the EM algorithm program can read.
function dataFrame2Matrix(data::DataFrame)
r,c = size(data)
matrix = zeros(r, c)
for i in 1:r
matrix[i, :] = vec(Array(data[i, :]))
end
return matrix
end
irisMatrix = dataFrame2Matrix(iris[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
To be added, it is necessary to make a function to convert DataFrame to jagged array because k-means algorithm will receive jagged array as an input.
function dataFrame2JaggedArray(data::DataFrame)
returnArray = Array{Array}(nrow(data))
for i in 1:nrow(data)
returnArray[i] = vec(Array(data[i,:]))
end
return returnArray
end
What I will give to EM algorithm as initialized values are center points, covariances and mixture rate. About center points, we can use the centroids of k-means. Mixture rate is the ratio of the number of data points each cluster has. Covariances are bit confusing. It is expressed as following mathematical expression.
- : the cluster
- : the number of data points the cluster has
- : the indices set of cluster
- : centroid of the cluster
The following function will execute k-means and as the initialized values for EM algorithm, will return the center points, covariances and mixture rate.
function getParameters(data, k)
resultKmeans = kMeans(data, k; initializer="kmeans++")
centroids = resultKmeans.centroids[end]
estimatedClass = resultKmeans.estimatedClass
covariances = []
mixRate = []
centroidLength = length(centroids[1])
for (i,centroid) in enumerate(centroids)
index = find(i .== estimatedClass)
clusterData = dataFrame2JaggedArray(resultKmeans.x[index, :])
summed = zeros(centroidLength, centroidLength)
for clusterMember in clusterData
summed += (clusterMember - centroid) * (clusterMember - centroid)'
end
covar = summed / length(clusterData)
push!(covariances, covar)
push!(mixRate, length(index))
end
mixRate /= sum(mixRate)
return centroids, covariances, mixRate
end
Finally, I'll tackle with the experiment. As I wrote before, I'll store and check the following three.
- the number of iteration
- accuracy
- time
This time, by one thousand repetitions, I'll store those information of EM algorithm with and without k-means initialization. On each iteration, k-means for initialization is done. For measuring time, we can use the macro @elapsed.
The EM algorithm I'm using here can receive composite type, Initialization, as initialized values. The values gotten by getParameters() are stored to the Initialization composite type and it is past to EM() through argument.
srand(12345)
iterCount = []
iterCountWithInit = []
emClass = []
emClassWithInit = []
emTimes = []
emTimesWithInit = []
errorNum = 0
for _ in 1:1000
initTime = @elapsed centroids, covariances, mixRate = getParameters(iris[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]], 3)
try
emTime = @elapsed em = EM(irisMatrix, 3)
initializer = Initialization(centroids, covariances, mixRate)
emInitTime = @elapsed emInit = EM(irisMatrix, 3; initialization=initializer)
push!(iterCount, em.iterCount)
push!(iterCountWithInit, emInit.iterCount)
push!(emClass, em.posterior[end])
push!(emClassWithInit, emInit.posterior[end])
push!(emTimes, emTime)
push!(emTimesWithInit, initTime + emInitTime)
catch
errorNum += 1
end
end
By making histograms, we can check the difference and effects of k-means initialization.
At first, the number of iteration. The consequence matches the intuition. On usual EM algorithm, the variance of distribution is big. With k-means initialization, the numbers of iteration strongly gather around the low point.
using PyPlot
plt[:hist](iterCount, color="red", alpha=0.5)
plt[:hist](iterCountWithInit, color="blue", alpha=0.5)
legend(["EM algorithm", "EM algorithm with K-means"])
title("Iteration")
Next, accuracy. Probably for most of the people, this is really interesting point. Em algorithm's gives probability to belong to each clusters. Here, to check the accuracy, I'll use argument max, meaning the data point belongs to the cluster whose possibility is the highest. Also, to check accuracy, I'll write the function for round-robin.
using MLBase
using Combinatorics
function calculateAccuracy(groundTruth, predicted)
maxCombination = []
maxAccuracy = 0.0
for combination in permutations(collect(Set(predicted)))
conv = Dict()
sortedGroundTruth = sort(collect(Set(groundTruth)))
for pair in zip(sortedGroundTruth, combination)
conv[pair[1]] = pair[2]
end
target = Array{Int}(length(groundTruth))
for (i,label) in enumerate(groundTruth)
target[i] = conv[label]
end
accuracy = correctrate(target, predicted)
if accuracy > maxAccuracy
maxAccuracy = accuracy
maxCombination = combination
end
end
return maxAccuracy
end
function iterArgMax(data)
predictedLabel = []
for oneTime in data
predicted = [indmax(prob) for prob in oneTime]
push!(predictedLabel, predicted)
end
return predictedLabel
end
By adapting argument max and accuracy calculation, we can check the accuracy of EM algorithm with and without k-means initialization.
speciesTemp = convert(Array, iris[:, [:Species]])
emDeterministic = iterArgMax(emClass)
emInitDeterministic = iterArgMax(emClassWithInit)
accEM = []
accEMWithInit = []
for i in 1:length(emDeterministic)
push!(accEM, calculateAccuracy(speciesTemp, emDeterministic[i]))
push!(accEMWithInit, calculateAccuracy(speciesTemp, emInitDeterministic[i]))
end
To Iris data, the outcome is like following.
Because the k-means initialization gave the near points for local optimal solutions, the accuracy is gathering around some points.
plt[:hist](accEM, color="red", alpha=0.5)
plt[:hist](accEMWithInit, color="blue", alpha=0.5)
legend(["EM algorithm", "EM algorithm with K-means"])
title("Accuracy")
Let's check how much time those took. About EM algorithm with k-means initialization, the total time of k-means initialization and EM algorithm is regarded as the one iteration's time.
About Iris data, with k-means initialization, the required time is shorten and stable.
plt[:hist](emTimes, color="red", alpha=0.5)
plt[:hist](emTimesWithInit, color="blue", alpha=0.5)
legend(["EM algorithm", "EM algorithm with K-means"])
title("Time")