Sunday, August 13, 2017

kNN by Golang from scratch

Overview

I wrote kNN(k nearest neighbors), one of the machine learning algorithms, by Go.
Go has some machine learning packages but it is hard to find the information of how to write machine learning algorithms by Go. So I stepwise introduce how to. This time kNN.
This article is to understand how the algorithm. So I don’t use some elements to improve the accuracy if it disturbs the understanding essential points.


What is kNN?

kNN is one of the supervised leaning algorithms. It predicts the data label by distance and majority rules.

Algorithm

kNN algorithm is very simple. When it predicts data label, it takes the nearest k labels and by majority rule, the label of data is predicted.
This algorithm apart from other supervised ones doesn’t have the concept of weights to update. It just keep the data which is composed of explaining variables and labels. When it predicts the target data, it calculates the distance between the target and the data which is kept. So, necessarily, the amount of time for predictions depends on the amount of data which is kept. It is sometimes huge deficit to take much time on the prediction phase. On the other hand, just simple distance calculation and majority rule makes kNN algorithm, meaning it is easy to use it with other algorithms without thinking many details. This is huge advantage.

Let's think the example.

On the image above, there are 3 types of circles, red, blue and green. About red and blue circles, they are training data which were read in advance. The color means the classes the data belong to. The green circle is the prediction target data. On the kNN algorithm, at first it calculates the distances between the green circle and other circles. Next, it takes k nearest circles. Now just think the k is equal to 3. The nearest 3 circles are composed of 2 blue and 1 red circles. So, by majority rule, the green circle's label is predicted as blue.

Code

The code I used is as following.
package main

import (
    "math"
    "sort"
    "os"
    "encoding/csv"
    "io"
    "strconv"
    "fmt"
)

func main() {
    //read data
    irisMatrix := [][]string{}
    iris, err := os.Open("iris.csv")
    if err != nil {
        panic(err)
    }
    defer iris.Close()

    reader := csv.NewReader(iris)
    reader.Comma = ','
    reader.LazyQuotes = true
    for {
        record, err := reader.Read()
        if err == io.EOF {
            break
        } else if err != nil {
            panic(err)
        }
        irisMatrix = append(irisMatrix, record)
    }

    //split data into explaining and explained variables
    X := [][]float64{}
    Y := []string{}
    for _, data := range irisMatrix {

        //convert str slice data into float slice data
        temp := []float64{}
        for _, i := range data[:4] {
            parsedValue, err := strconv.ParseFloat(i, 64)
            if err != nil {
                panic(err)
            }
            temp = append(temp, parsedValue)
        }
        //explaining variables
        X = append(X, temp)

        //explained variables
        Y = append(Y, data[4])

    }

    //split data into training and test
    var (
        trainX [][]float64
        trainY []string
        testX  [][]float64
        testY  []string
    )
    for i, _ := range X {
        if i%2 == 0 {
            trainX = append(trainX, X[i])
            trainY = append(trainY, Y[i])
        } else {
            testX = append(testX, X[i])
            testY = append(testY, Y[i])
        }
    }

    //training
    knn := KNN{}
    knn.k = 8
    knn.fit(trainX, trainY)
    predicted := knn.predict(testX)

    //check accuracy
    correct := 0
    for i, _ := range predicted {
        if predicted[i] == testY[i] {
            correct += 1
        }
    }
    fmt.Println(correct)
    fmt.Println(len(predicted))
    fmt.Println(float64(correct) / float64(len(predicted)))

}

//calculate euclidean distance betwee two slices
func Dist(source, dest []float64) float64 {
    val := 0.0
    for i, _ := range source {
        val += math.Pow(source[i]-dest[i], 2)
    }
    return math.Sqrt(val)
}

//argument sort
type Slice struct {
    sort.Interface
    idx []int
}

func (s Slice) Swap(i, j int) {
    s.Interface.Swap(i, j)
    s.idx[i], s.idx[j] = s.idx[j], s.idx[i]
}

func NewSlice(n sort.Interface) *Slice {
    s := &Slice{Interface: n, idx: make([]int, n.Len())}
    for i := range s.idx {
        s.idx[i] = i
    }
    return s
}

func NewFloat64Slice(n []float64) *Slice { return NewSlice(sort.Float64Slice(n)) }

//map sort
type Entry struct {
    name  string
    value int
}
type List []Entry

func (l List) Len() int {
    return len(l)
}

func (l List) Swap(i, j int) {
    l[i], l[j] = l[j], l[i]
}

func (l List) Less(i, j int) bool {
    if l[i].value == l[j].value {
        return l[i].name < l[j].name
    } else {
        return l[i].value > l[j].value
    }
}

//count item frequence in slice
func Counter(target []string) map[string]int {
    counter := map[string]int{}
    for _, elem := range target {
        counter[elem] += 1
    }
    return counter
}

type KNN struct {
    k      int
    data   [][]float64
    labels []string
}

func (knn *KNN) fit(X [][]float64, Y []string) {
    //read data
    knn.data = X
    knn.labels = Y
}

func (knn *KNN) predict(X [][]float64) []string {

    predictedLabel := []string{}
    for _, source := range X {
        var (
            distList   []float64
            nearLabels []string
        )
        //calculate distance between predict target data and surpervised data
        for _, dest := range knn.data {
            distList = append(distList, Dist(source, dest))
        }
        //take top k nearest item's index
        s := NewFloat64Slice(distList)
        sort.Sort(s)
        targetIndex := s.idx[:knn.k]

        //get the index's label
        for _, ind := range targetIndex {
            nearLabels = append(nearLabels, knn.labels[ind])
        }

        //get label frequency
        labelFreq := Counter(nearLabels)

        //the most frequent label is the predict target label
        a := List{}
        for k, v := range labelFreq {
            e := Entry{k, v}
            a = append(a, e)
        }
        sort.Sort(a)
        predictedLabel = append(predictedLabel, a[0].name)
    }
    return predictedLabel

}

Check one by one


Let’s check from the kNN algorithm part.

type KNN struct {
    k      int
    data   [][]float64
    labels []string
}

kNN structure has k, data and label. k means the number of nearest points which the predict target uses for prediction. data is the variable to store training data. label is the variable to store training data's labels.

func (knn *KNN) fit(X [][]float64, Y []string) {
    //read data
    knn.data = X
    knn.labels = Y
}

On this part, data is read. Usually, on other machine learning algorithm, fit() method does updating weights and so on. But on kNN, main calculation is done on prediction phase and the role of fit() is just to read data.

func (knn *KNN) predict(X [][]float64) []string {

    predictedLabel := []string{}
    for _, source := range X {
        var (
            distList   []float64
            nearLabels []string
        )
        //calculate distance between predict target data and surpervised data
        for _, dest := range knn.data {
            distList = append(distList, Dist(source, dest))
        }
        //take top k nearest item's index
        s := NewFloat64Slice(distList)
        sort.Sort(s)
        targetIndex := s.idx[:knn.k]

        //get the index's label
        for _, ind := range targetIndex {
            nearLabels = append(nearLabels, knn.labels[ind])
        }

        //get label frequency
        labelFreq := Counter(nearLabels)

        //the most frequent label is the predict target label
        a := List{}
        for k, v := range labelFreq {
            e := Entry{k, v}
            a = append(a, e)
        }
        sort.Sort(a)
        predictedLabel = append(predictedLabel, a[0].name)
    }
    return predictedLabel

}

Prediction part. On kNN, here, the main calculation is done.
It is bit long. So I subdivide it into smaller parts.

        //calculate distance between predict target data and surpervised data
        for _, dest := range knn.data {
            distList = append(distList, Dist(source, dest))
        }

It stores the distance between prediction target data and each stored data to distList. The function used here is as following.

//calculate euclidean distance between two slices
func Dist(source, dest []float64) float64 {
    val := 0.0
    for i, _ := range source {
        val += math.Pow(source[i]-dest[i], 2)
    }
    return math.Sqrt(val)
}

This function takes 2 slices and returns the distance between those. This time, I used euclidean distance.

        //take top k nearest item's index
        s := NewFloat64Slice(distList)
        sort.Sort(s)
        targetIndex := s.idx[:knn.k]

Here, it sorts the distance and takes top k index.

        //get the index's label
        for _, ind := range targetIndex {
            nearLabels = append(nearLabels, knn.labels[ind])
        }

The labels which is relevant to the top k index are stored to nearLabels.

        //get label frequency
        labelFreq := Counter(nearLabels)

        //the most frequent label is the predict target label
        a := List{}
        for k, v := range labelFreq {
            e := Entry{k, v}
            a = append(a, e)
        }
        sort.Sort(a)
        predictedLabel = append(predictedLabel, a[0].name)

Counter function stores the elements-frequency data as map. And by map sort, it returns the most frequent label. This is the prediction.

Accuracy


On this case, the prediction accuracy to test data is more or less 0.97.