Data Science Struggle: EM algorithm with Initialization by K-means

Abstract

On this article, I'll check the EM algorithm with the initialized values by k-means. In many algorithms, initial values are very important theme. On EM algorithm, with inappropriate initial values, it takes much time for convergence and if the algorithm is naively written, it stops with error because of non-positive-definite.

So, I'll do experiment to check how much the accuracy, the number of iteration and time change with and without initialized values by k-means. Here, I'll just touch EM algorithm about the mixture of Gaussian case.
If there is a mistake or inappropriate points, please let me know by comment.

Code of EM and k-means

I'll use the codes of EM algorithm and k-means from my GitHub.

Those code were written through those articles below. If you want to know the detail, please check the articles.

Overview of experiment

On EM algorithm, by the repetition of E-step and M-step, the posterior probabilities and the parameters are updated. From the article, Probabilistic Clustering with EM algorithm: Algorithm and Visualization with Julia from scratch, the GIF image below shows how cluster is built. We can observe the center point of cluster is moving in the loop. Actually as an another important parameter, there is covariance. But, on this image, we can’t see.

Anyway, what I want to do is to give better initialized values to EM algorithm. By k-means algorithm, I'll give better center points, covariances and mixture rate. And by comparison with usual EM algorithm, I'll try to see the characteristics.

The target data is Iris data set and the codes I'll use are on my GitHub.

I'll measure following three points.

the number of iteration
accuracy
time

Actually, by making histogram, we can also check the stability.

Experiment

After cloning from my GitHub, we will read the source codes.

include("./Clustering/src/kmeans.jl")
include("./EM/src/EM.jl")

Also, we need to read Iris data set from RDatasets.

using RDatasets
iris = dataset("datasets", "iris")

The iris data set is DataFrame as a default. So, I need to convert it to matrix so that the EM algorithm program can read.

function dataFrame2Matrix(data::DataFrame)
  r,c = size(data)
  matrix = zeros(r, c)
  for i in 1:r
      matrix[i, :] = vec(Array(data[i, :]))
  end
  return matrix
end
irisMatrix = dataFrame2Matrix(iris[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])

To be added, it is necessary to make a function to convert DataFrame to jagged array because k-means algorithm will receive jagged array as an input.

function dataFrame2JaggedArray(data::DataFrame)
  returnArray = Array{Array}(nrow(data))
  for i in 1:nrow(data)
      returnArray[i] = vec(Array(data[i,:]))
  end
  return returnArray
end

What I will give to EM algorithm as initialized values are center points, covariances and mixture rate. About center points, we can use the centroids of k-means. Mixture rate is the ratio of the number of data points each cluster has. Covariances are bit confusing. It is expressed as following mathematical expression.
$cov_d = \frac{1}{N_d}\sum_{i \in C_d}{(x_i - \mu_d)(x_i - \mu_d)'}$

$d$ : the cluster
$N_d$ : the number of data points the cluster $d$ has
$C_d$ : the indices set of cluster $d$
$\mu_d$ : centroid of the cluster $d$

The following function will execute k-means and as the initialized values for EM algorithm, will return the center points, covariances and mixture rate.

function getParameters(data, k)
  resultKmeans = kMeans(data, k; initializer="kmeans++")
  centroids = resultKmeans.centroids[end]
  estimatedClass = resultKmeans.estimatedClass

  covariances = []
  mixRate = []
  centroidLength = length(centroids[1])

  for (i,centroid) in enumerate(centroids)
      index = find(i .== estimatedClass)
      clusterData = dataFrame2JaggedArray(resultKmeans.x[index, :])

      summed = zeros(centroidLength, centroidLength)
      for clusterMember in clusterData
          summed += (clusterMember - centroid) * (clusterMember - centroid)'
      end
      covar = summed / length(clusterData)

      push!(covariances, covar)
      push!(mixRate, length(index))
  end

  mixRate /= sum(mixRate)
  return centroids, covariances, mixRate
end

Finally, I'll tackle with the experiment. As I wrote before, I'll store and check the following three.

the number of iteration
accuracy
time

This time, by one thousand repetitions, I'll store those information of EM algorithm with and without k-means initialization. On each iteration, k-means for initialization is done. For measuring time, we can use the macro @elapsed.

The EM algorithm I'm using here can receive composite type, Initialization, as initialized values. The values gotten by getParameters() are stored to the Initialization composite type and it is past to EM() through argument.

srand(12345)
iterCount = []
iterCountWithInit = []
emClass = []
emClassWithInit = []
emTimes = []
emTimesWithInit = []
errorNum = 0
for _ in 1:1000
  initTime = @elapsed centroids, covariances, mixRate = getParameters(iris[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]], 3)  
  try
      emTime = @elapsed em = EM(irisMatrix, 3)

      initializer = Initialization(centroids, covariances, mixRate)
      emInitTime = @elapsed emInit = EM(irisMatrix, 3; initialization=initializer)

      push!(iterCount, em.iterCount)
      push!(iterCountWithInit, emInit.iterCount)
      push!(emClass, em.posterior[end])
      push!(emClassWithInit, emInit.posterior[end])
      push!(emTimes, emTime)
      push!(emTimesWithInit, initTime + emInitTime)
  catch
      errorNum += 1
  end
end

By making histograms, we can check the difference and effects of k-means initialization.
At first, the number of iteration. The consequence matches the intuition. On usual EM algorithm, the variance of distribution is big. With k-means initialization, the numbers of iteration strongly gather around the low point.

using PyPlot

plt[:hist](iterCount, color="red", alpha=0.5)
plt[:hist](iterCountWithInit, color="blue", alpha=0.5)
legend(["EM algorithm", "EM algorithm with K-means"])
title("Iteration")

Next, accuracy. Probably for most of the people, this is really interesting point. Em algorithm's gives probability to belong to each clusters. Here, to check the accuracy, I'll use argument max, meaning the data point belongs to the cluster whose possibility is the highest. Also, to check accuracy, I'll write the function for round-robin.

using MLBase
using Combinatorics

function calculateAccuracy(groundTruth, predicted)
  maxCombination = []
  maxAccuracy = 0.0

  for combination in permutations(collect(Set(predicted)))
      conv = Dict()
      sortedGroundTruth = sort(collect(Set(groundTruth)))
      for pair in zip(sortedGroundTruth, combination)
          conv[pair[1]] = pair[2]
      end
      target = Array{Int}(length(groundTruth))
      for (i,label) in enumerate(groundTruth)
          target[i] = conv[label]
      end
      accuracy = correctrate(target, predicted)
      if accuracy > maxAccuracy
          maxAccuracy = accuracy
          maxCombination = combination
      end
  end
  return maxAccuracy
end

function iterArgMax(data)
  predictedLabel = []
  for oneTime in data
      predicted = [indmax(prob) for prob in oneTime]
      push!(predictedLabel, predicted)
  end
  return predictedLabel
end

By adapting argument max and accuracy calculation, we can check the accuracy of EM algorithm with and without k-means initialization.

speciesTemp = convert(Array, iris[:, [:Species]])

emDeterministic = iterArgMax(emClass)
emInitDeterministic = iterArgMax(emClassWithInit)

accEM = []
accEMWithInit = []
for i in 1:length(emDeterministic)
  push!(accEM, calculateAccuracy(speciesTemp, emDeterministic[i]))
  push!(accEMWithInit, calculateAccuracy(speciesTemp, emInitDeterministic[i]))
end

To Iris data, the outcome is like following.
Because the k-means initialization gave the near points for local optimal solutions, the accuracy is gathering around some points.

plt[:hist](accEM, color="red", alpha=0.5)
plt[:hist](accEMWithInit, color="blue", alpha=0.5)
legend(["EM algorithm", "EM algorithm with K-means"])
title("Accuracy")

Let's check how much time those took. About EM algorithm with k-means initialization, the total time of k-means initialization and EM algorithm is regarded as the one iteration's time.

About Iris data, with k-means initialization, the required time is shorten and stable.

plt[:hist](emTimes, color="red", alpha=0.5)
plt[:hist](emTimesWithInit, color="blue", alpha=0.5)
legend(["EM algorithm", "EM algorithm with K-means"])
title("Time")

Reference

Simple Methods for Initializing the EM Algorithm for Gaussian Mixture Models

Data Science Struggle

Wednesday, May 9, 2018

EM algorithm with Initialization by K-means

Abstract

Code of EM and k-means

Overview of experiment

Experiment

Reference

Blog Archive