Sunday, August 20, 2017

Speed up naive kNN by the concept of kmeans

Overview

About prediction, kNN(k nearest neighbors) is very slow algorithm, because it calculates all the distances between predict target and training data point on the predict phase.
By adding some process, I tried to make the naive kNN speed up and checked how much the time and accuracy changes.


Motivation


These days, I have been writing machine learning algorithms from scratch by Go. Although in the most of the case, it is quite better to use machine learning libraries than to write by yourself, you can add what you want when you do it from scratch and those trials deepen the understanding to the algorithms.
And in some cases, machine learning algorithms can be regarded as black box and untouchable. But actually it is not. I want to show that sometimes it is efficient to add some process to the models.

Experiment Environment


The algorithm I use is kNN, k nearest neighbors. The code is from kNN by Golang from scratch. On that article, I used iris data which has 150 rows and 4 features. This time, to check the time it takes, I increased the data size, rows, by 100 times.
The data size is as following.
  • rows: 15000
  • features: 4
I used the half of it as the training data and the another as test data. In a nutshell, the goal is to make the prediction to 7500 speed up.

Policy to speed up

Why kNN’s prediction is slow?


Why is kNN’s prediction slow?
This is because it does main calculation on the prediction phase and on pre-phase it doesn’t calculate for prediction.





What kNN does at prediction phase is shown on the image above. The red and blue circles mean the data which were read on the before. The green one is the prediction target. Simply, kNN calculates the distance between prediction target and training data which are read before and by the majority rules of the k nearest point of the training data it predicts the label. On the case of this image, if the k=2, the nearest 3 circles from the green one are 2 blue circles and 1 red circle, meaning by majority rule, the green circle belongs to the blue circles.
As it increases the amount of the training data, it takes much time to predict.

Approach


The approach for speed-up is simple. I just thought to decrease the data points which is used at the phase of distance calculation. On naive kNN, it calculates all the distances between prediction targets and the training data. By introducing representative points to the training data, I decreased the calculations.
Concretely, I used the concept of kmeans. On the phase of reading training data, by using kmeans concept, it calculates centroid points of the subsets of the training data. On the prediction phase, it calculates the distances between those centroid points at first and get the data points which belongs to the nearest centroid point. On the next step, it calculates the distances between the prediction target and those data points. You can get the label of the prediction target by the majority rules of k nearest data points.



For example, on the case of image above, light red an blue circles mean the data and the deep blue, yellow, purple circles mean the centroid points calculated on the phase of kmeans. At first, when you read the training data, you can calculate the those three centroid points and which data points belongs to each centroid points. On the predict phase(on this case, the green circle is the prediction target), you calculates the distances between the prediction target and each of the centroid points. Here, the nearest centroid point from the green circle is the deep purple circle. Next, you calculate the distance between the green circle and circles which belong to the deep purple circle. And by the majority rule of the nearest k circles, you can judge the label the green circle belongs to. On this case, blue.

Code I used


By adding kmeans concept to the code from kNN by Golang from scratch, I did speed up.
Concretely, I added kmeans to the fit() method, originally it just read training data, and made KNN constructor have classified cluster and centroid points. On the prediction phase, it calculates the distance between prediction target and those centroid points at first and on the next step, it calculates the distances between prediction targets and the data which belongs to the nearest centroid point.
The code is as following.

package main

import (
    "math"
    "sort"
    "os"
    "encoding/csv"
    "io"
    "strconv"
    "fmt"
    "reflect"
    "time"
    "math/rand"
)

func main() {
    //read data
    irisMatrix := [][]string{}
    iris, err := os.Open("big_iris.csv")
    if err != nil {
        panic(err)
    }
    defer iris.Close()

    reader := csv.NewReader(iris)
    reader.Comma = ','
    reader.LazyQuotes = true
    for {
        record, err := reader.Read()
        if err == io.EOF {
            break
        } else if err != nil {
            panic(err)
        }
        irisMatrix = append(irisMatrix, record)
    }

    //split data into explaining and explained variables
    X := [][]float64{}
    Y := []string{}
    for _, data := range irisMatrix {

        //convert str slice data into float slice data
        temp := []float64{}
        for _, i := range data[:4] {
            parsedValue, err := strconv.ParseFloat(i, 64)
            if err != nil {
                panic(err)
            }
            temp = append(temp, parsedValue)
        }
        //explaining variables
        X = append(X, temp)

        //explained variables
        Y = append(Y, data[4])

    }

    //split data into training and test
    var (
        trainX [][]float64
        trainY []string
        testX  [][]float64
        testY  []string
    )
    for i, _ := range X {
        if i%2 == 0 {
            trainX = append(trainX, X[i])
            trainY = append(trainY, Y[i])
        } else {
            testX = append(testX, X[i])
            testY = append(testY, Y[i])
        }
    }

    //training
    knn := KNN{}
    knn.k = 8
    knn.fit(trainX, trainY)
    predicted := knn.predict(testX)

    //check accuracy
    correct := 0
    for i, _ := range predicted {
        if predicted[i] == testY[i] {
            correct += 1
        }
    }
    fmt.Println(correct)
    fmt.Println(len(predicted))
    fmt.Println(float64(correct) / float64(len(predicted)))

}
func Transpose(source [][]float64) [][]float64 {
    var dest [][]float64
    for i := 0; i < len(source[0]); i++ {
        var temp []float64
        for j := 0; j < len(source); j++ {
            temp = append(temp, 0.0)
        }
        dest = append(dest, temp)
    }

    for i := 0; i < len(source); i++ {
        for j := 0; j < len(source[0]); j++ {
            dest[j][i] = source[i][j]
        }
    }
    return dest
}

//return index when the value is the smallest
func ArgMin(target []float64) int {
    var (
        index int
        base  float64
    )
    for i, d := range target {
        if i == 0 {
            index = i
            base = d
        } else {
            if d < base {
                index = i
                base = d
            }
        }

    }
    return index
}

//calculate euclidean distance betwee two slices
func Dist(source, dest []float64) float64 {
    val := 0.0
    for i, _ := range source {
        val += math.Pow(source[i]-dest[i], 2)
    }
    return math.Sqrt(val)
}

//argument sort
type Slice struct {
    sort.Interface
    idx []int
}

func (s Slice) Swap(i, j int) {
    s.Interface.Swap(i, j)
    s.idx[i], s.idx[j] = s.idx[j], s.idx[i]
}

func NewSlice(n sort.Interface) *Slice {
    s := &Slice{Interface: n, idx: make([]int, n.Len())}
    for i := range s.idx {
        s.idx[i] = i
    }
    return s
}

func NewFloat64Slice(n []float64) *Slice { return NewSlice(sort.Float64Slice(n)) }

//map sort
type Entry struct {
    name  string
    value int
}
type List []Entry

func (l List) Len() int {
    return len(l)
}

func (l List) Swap(i, j int) {
    l[i], l[j] = l[j], l[i]
}

func (l List) Less(i, j int) bool {
    if l[i].value == l[j].value {
        return l[i].name < l[j].name
    } else {
        return l[i].value > l[j].value
    }
}

//Count the frequence of each items
func Counter(target []string) map[string]int {
    counter := map[string]int{}
    for _, elem := range target {
        counter[elem] += 1
    }
    return counter
}

type KNN struct {
    k            int
    data         [][]float64
    labels       []string
    clusterLabel []int
    centroid     [][]float64
}

func (knn *KNN) fit(X [][]float64, Y []string) {
    //read data
    knn.data = X
    knn.labels = Y
    //assign data on clusters randomly
    rand.Seed(time.Now().UnixNano())
    clusterSize := len(knn.labels) / 10
    for i := 0; i < len(knn.labels); i++ {
        knn.clusterLabel = append(knn.clusterLabel, rand.Intn(clusterSize))
    }
    //calculates centroid point of each clusters
    for i := 0; i < clusterSize; i++ {
        var clusterGroup [][]float64
        for j, cluster := range knn.clusterLabel {
            if cluster == i {
                clusterGroup = append(clusterGroup, knn.data[j])
            }
        }
        transposedClusterGroup := Transpose(clusterGroup)
        clusterCentroid := []float64{}
        for _, values := range transposedClusterGroup {
            //get average
            mean := 0.0
            for _, value := range values {
                mean += value
            }
            clusterCentroid = append(clusterCentroid, mean/float64(len(values)))

        }
        knn.centroid = append(knn.centroid, clusterCentroid)
    }

    for {
        //update centroid points
        var tempRepresentatives [][]float64
        for i, _ := range knn.centroid {
            var grouped [][]float64
            for j, d := range knn.data {
                if knn.clusterLabel[j] == i {
                    grouped = append(grouped, d)
                }
            }
            if len(grouped) != 0 {

                transposedGroup := Transpose(grouped)
                updated := []float64{}
                for _, vectors := range transposedGroup {

                    value := 0.0
                    for _, v := range vectors {
                        value += v
                    }
                    //store average of each characteristics
                    updated = append(updated, value/float64(len(vectors)))
                }
                tempRepresentatives = append(tempRepresentatives, updated)
            }
        }
        knn.centroid = tempRepresentatives

        //updates label
        tempLabel := []int{}
        for _, d := range knn.data {
            var distance []float64
            for _, r := range knn.centroid {
                distance = append(distance, Dist(d, r))
            }
            tempLabel = append(tempLabel, ArgMin(distance))
        }
        if reflect.DeepEqual(knn.clusterLabel, tempLabel) {
            break
        } else {
            knn.clusterLabel = tempLabel
        }
    }
}

func (knn *KNN) predict(X [][]float64) []string {

    predictedLabel := []string{}
    for _, source := range X {
        var (
            centDistList []float64
            distList     []float64
            nearLabels   []string
        )
        //finds the nearest centroid point
        //calculates the distance between the prediction target and centroid point
        for _, dest := range knn.centroid {
            centDistList = append(centDistList, Dist(source, dest))
        }
        //get the nearest centroid point's label
        centS := NewFloat64Slice(centDistList)
        sort.Sort(centS)
        centroidTargetLabel := centS.idx[0]

        //calculates the distance between prediction target and training data
        var tempLabel []string
        for i, dest := range knn.data {
            if knn.clusterLabel[i] == centroidTargetLabel {
                distList = append(distList, Dist(source, dest))
                tempLabel = append(tempLabel, knn.labels[i])
            }
        }
        //get index of the nearest k points
        s := NewFloat64Slice(distList)
        sort.Sort(s)
        targetIndex := s.idx[:knn.k]

        //get label relevant to the index
        for _, ind := range targetIndex {
            nearLabels = append(nearLabels, tempLabel[ind])
        }

        //get label frequency
        labelFreq := Counter(nearLabels)


        a := List{}
        for k, v := range labelFreq {
            e := Entry{k, v}
            a = append(a, e)
        }
        sort.Sort(a)
        predictedLabel = append(predictedLabel, a[0].name)
    }
    return predictedLabel

}

Compare accuracies


With naive kNN prediction, the time and accuracy is as followings.
  • accuracy :0.96
  • time:17.10s
The mixture of kmeans and kNN is as followings.
  • accuracy:0.96
  • time:2.46s
Without changing accuracy, the time was shortened.

Summary


By adding one process, I succeeded to shorten the time of prediction of the naive algorithm.
Sometimes, by adding some process to the existing algorithms, you can get better outcome(of course it depends on the goal of the model).
But other times, if you add inappropriate process to the algorithm, the accuracy and robustness can decrease.
And you need to keep in mind that the existing machine learning libraries are decent. Even when the goal is specific and focused well, in most cases, existing machine learning libraries can exceed to your original “optimized” algorithm.