Overview
Here, I introduced how to write kmeans, one of the machine learning algorithms, on Go. I’m almost new to Go and this is coding exercise for me through machine learning algorithms.
Go has some machine learning packages. But I couldn’t find the information about how to write machine learning algorithms from scratch. So, I stepwise introduce those.
Those I write here is not for practical use but for understanding algorithms by reading and writing, leading me to set priority on making the code simpler by sacrificing accuracy-improving elements if those are not very easy.
What is kmeans?
kmeans is one of the teacher less machine leaning algorithms and separates data into specific cluster. The remarkable characteristics of this is that you can choose the number of clusters the data are separated into. I think this characteristics makes kmeans be one of the most easy access algorithms between clustering algorithms.
Algorithm
kmeans algorithm defines k representative points and makes the data belong to the nearest representative point. About each representative points, it calculates the average of the data which belongs to that point. The calculated point becomes new representative point and again it makes the data belong to the nearest ones. Until you can't find the state change, just repeat the cycle.
In a nutshell, it can be summarized as followings.
After defining first representative points,
- make data belong to the nearest representative point
- make the centroid point of the attributed data new representative point
- repeat those 2 steps above until the state doesn’t change
The algorithms I wrote here
I chose following way to write the kmeans algorithm.
- initialize representative points by range-fixed random number
- use euclidean distance as distance function
- truncate it when it can’t observe any label change by upgrade
The code I used
package main
import (
"os"
"encoding/csv"
"io"
"strconv"
"math"
"math/rand"
"reflect"
"fmt"
"time"
)
func main() {
//read data
irisMatrix := [][]string{}
iris, err := os.Open("iris.csv")
if err != nil {
panic(err)
}
defer iris.Close()
reader := csv.NewReader(iris)
reader.Comma = ','
reader.LazyQuotes = true
for {
record, err := reader.Read()
if err == io.EOF {
break
} else if err != nil {
panic(err)
}
irisMatrix = append(irisMatrix, record)
}
X := [][]float64{}
Y := []string{}
for _, data := range irisMatrix {
//convert str-slice into float-slice
temp := []float64{}
for _, i := range data[:4] {
parsedValue, err := strconv.ParseFloat(i, 64)
if err != nil {
panic(err)
}
temp = append(temp, parsedValue)
}
//to explaining variables
X = append(X, temp)
//to explained variable
Y = append(Y, data[4])
}
km := Kmeans{}
fmt.Printf("predict:%v\n", km.fit(X, 3))
fmt.Printf("teacher:%v\n", Y)
}
type Kmeans struct {
data [][]float64
labels []int
representatives [][]float64
}
func Transpose(source [][]float64) [][]float64 {
var dest [][]float64
for i := 0; i < len(source[0]); i++ {
var temp []float64
for j := 0; j < len(source); j++ {
temp = append(temp, 0.0)
}
dest = append(dest, temp)
}
for i := 0; i < len(source); i++ {
for j := 0; j < len(source[0]); j++ {
dest[j][i] = source[i][j]
}
}
return dest
}
//calculate the distance between two points.Euclidean
func Dist(source, dest []float64) float64 {
var dist float64
for i, _ := range source {
dist += math.Pow(source[i]-dest[i], 2)
}
return math.Sqrt(dist)
}
//return index at the point whose value is the smallest on the slice
func ArgMin(target []float64) int {
var (
index int
base float64
)
for i, d := range target {
if i == 0 {
index = i
base = d
} else {
if d < base {
index = i
base = d
}
}
}
return index
}
func (km *Kmeans) fit(X [][]float64, k int) []int {
//set random number seeds
rand.Seed(time.Now().UnixNano())
//store data into structure
km.data = X
//initialize representative vectors
//to define the random number range for initialization of representative point, check the max and minimum values of each explaining variables
transposedData := Transpose(km.data)
var minMax [][]float64
for _, d := range transposedData {
var (
min float64
max float64
)
for _, n := range d {
min = math.Min(min, n)
max = math.Max(max, n)
}
minMax = append(minMax, []float64{min, max})
}
//set initital values of representative points
for i := 0; i < k; i++ {
km.representatives = append(km.representatives, make([]float64, len(minMax)))
}
for i := 0; i < k; i++ {
for j := 0; j < len(minMax); j++ {
km.representatives[i][j] = rand.Float64()*(minMax[j][1]-minMax[j][0]) + minMax[j][0]
}
}
//initialize label
//calclate distance between each data and representative point and give label
for _, d := range km.data {
var distance []float64
for _, r := range km.representatives {
distance = append(distance, Dist(d, r))
}
km.labels = append(km.labels, ArgMin(distance))
}
for {
//update representative point
//set the centroid of the data which belong to the representative point as updated representative point
//index i means the label
var tempRepresentatives [][]float64
for i, _ := range km.representatives {
var grouped [][]float64
for j, d := range km.data {
if km.labels[j] == i {
grouped = append(grouped, d)
}
}
if len(grouped) != 0 {
transposedGroup := Transpose(grouped)
updated := []float64{}
for _, vectors := range transposedGroup {
value := 0.0
for _, v := range vectors {
value += v
}
//store mean of each explaining variables
updated = append(updated, value/float64(len(vectors)))
}
tempRepresentatives = append(tempRepresentatives, updated)
}
}
km.representatives = tempRepresentatives
//update labels
tempLabel := []int{}
for _, d := range km.data {
var distance []float64
for _, r := range km.representatives {
distance = append(distance, Dist(d, r))
}
tempLabel = append(tempLabel, ArgMin(distance))
}
if reflect.DeepEqual(km.labels, tempLabel) {
break
} else {
km.labels = tempLabel
}
}
return km.labels
}
Check one by one
The major part of the code is fit() method. It is too long to check all at once. So let’s split and check.
func (km *Kmeans) fit(X [][]float64, k int) []int {
//set random number seeds
rand.Seed(time.Now().UnixNano())
//store data into structure
km.data = X
fit() method takes clustering target data and the number of the clusters as arguments. At first, here, it store target data into the structure.
//initialize representative vectors
//to define the random number range for initialization of representative point, check the max and minimum values of each explaining variables
transposedData := Transpose(km.data)
var minMax [][]float64
for _, d := range transposedData {
var (
min float64
max float64
)
for _, n := range d {
min = math.Min(min, n)
max = math.Max(max, n)
}
minMax = append(minMax, []float64{min, max})
}
kmeans prepares for the k representative points at first.(This k is the number you can choose as the number of clusters.) And those are updated by data. Practically, initial values of those representative points are very important. This time I just give the initial values by random number whose range is from minimum of relevant explaining variables to maximum of it.
//set initital values of representative points
for i := 0; i < k; i++ {
km.representatives = append(km.representatives, make([]float64, len(minMax)))
}
for i := 0; i < k; i++ {
for j := 0; j < len(minMax); j++ {
km.representatives[i][j] = rand.Float64()*(minMax[j][1]-minMax[j][0]) + minMax[j][0]
}
}
Here, it initializes the representative points.
//initialize label
//calclate distance between each data and representative point and give label
for _, d := range km.data {
var distance []float64
for _, r := range km.representatives {
distance = append(distance, Dist(d, r))
}
km.labels = append(km.labels, ArgMin(distance))
}
This part calculates the distance between data and representative points and the nearest representative point’s label is given to the data. This time, the index of the representative points are used as label.
for {
//update representative point
//set the centroid of the data which belong to the representative point as updated representative point
//index i means the label
var tempRepresentatives [][]float64
for i, _ := range km.representatives {
var grouped [][]float64
for j, d := range km.data {
if km.labels[j] == i {
grouped = append(grouped, d)
}
}
if len(grouped) != 0 {
transposedGroup := Transpose(grouped)
updated := []float64{}
for _, vectors := range transposedGroup {
value := 0.0
for _, v := range vectors {
value += v
}
//store mean of each explaining variables
updated = append(updated, value/float64(len(vectors)))
}
tempRepresentatives = append(tempRepresentatives, updated)
}
}
km.representatives = tempRepresentatives
//update labels
tempLabel := []int{}
for _, d := range km.data {
var distance []float64
for _, r := range km.representatives {
distance = append(distance, Dist(d, r))
}
tempLabel = append(tempLabel, ArgMin(distance))
}
if reflect.DeepEqual(km.labels, tempLabel) {
break
} else {
km.labels = tempLabel
}
}
This is the main part of the kmeans algorithm. On the first half, it updates the representative points and on the latter half, it updates data’s label. Concretely, updating representative points sets the centroid of the data which belong to the representative point as new updated point. Updating labels sets new label by calculating the distance between data and representative points.
Usually, truncation is done when the state of change becomes stable, meaning the label changes by representative points change, the distance change and so on. This time I just chose concise way which truncates when the representative point update doesn’t change any label, to make the code small. Actually, on this way, if the initialized representative points is so biased, some labels have data too much and one time update is not enough to change the label and the training can be truncated.
So, to use this algorithm, it is better to think about how to give initial values and how to truncate.