Saturday, August 12, 2017

kmeans by Golang from scratch

Overview


Here, I introduced how to write kmeans, one of the machine learning algorithms, on Go. I’m almost new to Go and this is coding exercise for me through machine learning algorithms.
Go has some machine learning packages. But I couldn’t find the information about how to write machine learning algorithms from scratch. So, I stepwise introduce those.
Those I write here is not for practical use but for understanding algorithms by reading and writing, leading me to set priority on making the code simpler by sacrificing accuracy-improving elements if those are not very easy.


What is kmeans?


kmeans is one of the teacher less machine leaning algorithms and separates data into specific cluster. The remarkable characteristics of this is that you can choose the number of clusters the data are separated into. I think this characteristics makes kmeans be one of the most easy access algorithms between clustering algorithms.

Algorithm


kmeans algorithm defines k representative points and makes the data belong to the nearest representative point. About each representative points, it calculates the average of the data which belongs to that point. The calculated point becomes new representative point and again it makes the data belong to the nearest ones. Until you can't find the state change, just repeat the cycle.
In a nutshell, it can be summarized as followings.
After defining first representative points,
  1. make data belong to the nearest representative point
  2. make the centroid point of the attributed data new representative point
  3. repeat those 2 steps above until the state doesn’t change

The algorithms I wrote here


I chose following way to write the kmeans algorithm.
  • initialize representative points by range-fixed random number
  • use euclidean distance as distance function
  • truncate it when it can’t observe any label change by upgrade
Actually, about initializing the representative points, it is usual to take centroid point of data which is randomly assigned on the k clusters and you can choose and set some types of truncation. This time, I chose the way above without reason.

The code I used


package main

import (
    "os"
    "encoding/csv"
    "io"
    "strconv"
    "math"
    "math/rand"
    "reflect"
    "fmt"
    "time"
)

func main() {
    //read data
    irisMatrix := [][]string{}
    iris, err := os.Open("iris.csv")
    if err != nil {
        panic(err)
    }
    defer iris.Close()

    reader := csv.NewReader(iris)
    reader.Comma = ','
    reader.LazyQuotes = true
    for {
        record, err := reader.Read()
        if err == io.EOF {
            break
        } else if err != nil {
            panic(err)
        }
        irisMatrix = append(irisMatrix, record)

    }

    X := [][]float64{}
    Y := []string{}
    for _, data := range irisMatrix {
        //convert str-slice into float-slice
        temp := []float64{}
        for _, i := range data[:4] {
            parsedValue, err := strconv.ParseFloat(i, 64)
            if err != nil {
                panic(err)
            }
            temp = append(temp, parsedValue)
        }
        //to explaining variables
        X = append(X, temp)
        //to explained variable
        Y = append(Y, data[4])
    }
    km := Kmeans{}
    fmt.Printf("predict:%v\n", km.fit(X, 3))
    fmt.Printf("teacher:%v\n", Y)
}

type Kmeans struct {
    data            [][]float64
    labels          []int
    representatives [][]float64
}

func Transpose(source [][]float64) [][]float64 {
    var dest [][]float64
    for i := 0; i < len(source[0]); i++ {
        var temp []float64
        for j := 0; j < len(source); j++ {
            temp = append(temp, 0.0)
        }
        dest = append(dest, temp)
    }

    for i := 0; i < len(source); i++ {
        for j := 0; j < len(source[0]); j++ {
            dest[j][i] = source[i][j]
        }
    }
    return dest
}

//calculate the distance between two points.Euclidean
func Dist(source, dest []float64) float64 {
    var dist float64
    for i, _ := range source {
        dist += math.Pow(source[i]-dest[i], 2)
    }
    return math.Sqrt(dist)
}

//return index at the point whose value is the smallest on the slice
func ArgMin(target []float64) int {
    var (
        index int
        base  float64
    )
    for i, d := range target {
        if i == 0 {
            index = i
            base = d
        } else {
            if d < base {
                index = i
                base = d
            }
        }

    }
    return index
}

func (km *Kmeans) fit(X [][]float64, k int) []int {
    //set random number seeds
    rand.Seed(time.Now().UnixNano())
    //store data into structure
    km.data = X

    //initialize representative vectors
    //to define the random number range for initialization of representative point, check the max and minimum values of each explaining variables
    transposedData := Transpose(km.data)
    var minMax [][]float64
    for _, d := range transposedData {
        var (
            min float64
            max float64
        )
        for _, n := range d {
            min = math.Min(min, n)
            max = math.Max(max, n)
        }
        minMax = append(minMax, []float64{min, max})
    }
    //set initital values of representative points
    for i := 0; i < k; i++ {
        km.representatives = append(km.representatives, make([]float64, len(minMax)))
    }
    for i := 0; i < k; i++ {
        for j := 0; j < len(minMax); j++ {
            km.representatives[i][j] = rand.Float64()*(minMax[j][1]-minMax[j][0]) + minMax[j][0]
        }
    }
    //initialize label
    //calclate distance between each data and representative point and give label
    for _, d := range km.data {
        var distance []float64
        for _, r := range km.representatives {
            distance = append(distance, Dist(d, r))
        }
        km.labels = append(km.labels, ArgMin(distance))
    }
    for {
        //update representative point
        //set the centroid of the data which belong to the representative point as updated representative point
        //index i means the label
        var tempRepresentatives [][]float64
        for i, _ := range km.representatives {
            var grouped [][]float64
            for j, d := range km.data {
                if km.labels[j] == i {
                    grouped = append(grouped, d)
                }
            }
            if len(grouped) != 0 {

                transposedGroup := Transpose(grouped)
                updated := []float64{}
                for _, vectors := range transposedGroup {

                    value := 0.0
                    for _, v := range vectors {
                        value += v
                    }
                    //store mean of each explaining variables
                    updated = append(updated, value/float64(len(vectors)))
                }
                tempRepresentatives = append(tempRepresentatives, updated)
            }
        }
        km.representatives = tempRepresentatives

        //update labels
        tempLabel := []int{}
        for _, d := range km.data {
            var distance []float64
            for _, r := range km.representatives {
                distance = append(distance, Dist(d, r))
            }
            tempLabel = append(tempLabel, ArgMin(distance))
        }
        if reflect.DeepEqual(km.labels, tempLabel) {
            break
        } else {
            km.labels = tempLabel
        }
    }
    return km.labels
}

Check one by one


The major part of the code is fit() method. It is too long to check all at once. So let’s split and check.

func (km *Kmeans) fit(X [][]float64, k int) []int {
    //set random number seeds
    rand.Seed(time.Now().UnixNano())
    //store data into structure
    km.data = X

fit() method takes clustering target data and the number of the clusters as arguments. At first, here, it store target data into the structure.

    //initialize representative vectors
    //to define the random number range for initialization of representative point, check the max and minimum values of each explaining variables
    transposedData := Transpose(km.data)
    var minMax [][]float64
    for _, d := range transposedData {
        var (
            min float64
            max float64
        )
        for _, n := range d {
            min = math.Min(min, n)
            max = math.Max(max, n)
        }
        minMax = append(minMax, []float64{min, max})
    }

kmeans prepares for the k representative points at first.(This k is the number you can choose as the number of clusters.) And those are updated by data. Practically, initial values of those representative points are very important. This time I just give the initial values by random number whose range is from minimum of relevant explaining variables to maximum of it.

    //set initital values of representative points
    for i := 0; i < k; i++ {
        km.representatives = append(km.representatives, make([]float64, len(minMax)))
    }
    for i := 0; i < k; i++ {
        for j := 0; j < len(minMax); j++ {
            km.representatives[i][j] = rand.Float64()*(minMax[j][1]-minMax[j][0]) + minMax[j][0]
        }
    }

Here, it initializes the representative points.

    //initialize label
    //calclate distance between each data and representative point and give label
    for _, d := range km.data {
        var distance []float64
        for _, r := range km.representatives {
            distance = append(distance, Dist(d, r))
        }
        km.labels = append(km.labels, ArgMin(distance))
    }

This part calculates the distance between data and representative points and the nearest representative point’s label is given to the data. This time, the index of the representative points are used as label.

    for {
        //update representative point
        //set the centroid of the data which belong to the representative point as updated representative point
        //index i means the label
        var tempRepresentatives [][]float64
        for i, _ := range km.representatives {
            var grouped [][]float64
            for j, d := range km.data {
                if km.labels[j] == i {
                    grouped = append(grouped, d)
                }
            }
            if len(grouped) != 0 {

                transposedGroup := Transpose(grouped)
                updated := []float64{}
                for _, vectors := range transposedGroup {

                    value := 0.0
                    for _, v := range vectors {
                        value += v
                    }
                    //store mean of each explaining variables
                    updated = append(updated, value/float64(len(vectors)))
                }
                tempRepresentatives = append(tempRepresentatives, updated)
            }
        }
        km.representatives = tempRepresentatives

        //update labels
        tempLabel := []int{}
        for _, d := range km.data {
            var distance []float64
            for _, r := range km.representatives {
                distance = append(distance, Dist(d, r))
            }
            tempLabel = append(tempLabel, ArgMin(distance))
        }
        if reflect.DeepEqual(km.labels, tempLabel) {
            break
        } else {
            km.labels = tempLabel
        }
    }

This is the main part of the kmeans algorithm. On the first half, it updates the representative points and on the latter half, it updates data’s label. Concretely, updating representative points sets the centroid of the data which belong to the representative point as new updated point. Updating labels sets new label by calculating the distance between data and representative points.
Usually, truncation is done when the state of change becomes stable, meaning the label changes by representative points change, the distance change and so on. This time I just chose concise way which truncates when the representative point update doesn’t change any label, to make the code small. Actually, on this way, if the initialized representative points is so biased, some labels have data too much and one time update is not enough to change the label and the training can be truncated.
So, to use this algorithm, it is better to think about how to give initial values and how to truncate.