Overview
About prediction, kNN(k nearest neighbors) is very slow algorithm, because it calculates all the distances between predict target and training data point on the predict phase.By adding some process, I tried to make the naive kNN speed up and checked how much the time and accuracy changes.
Motivation
These days, I have been writing machine learning algorithms from scratch by Go. Although in the most of the case, it is quite better to use machine learning libraries than to write by yourself, you can add what you want when you do it from scratch and those trials deepen the understanding to the algorithms.
And in some cases, machine learning algorithms can be regarded as black box and untouchable. But actually it is not. I want to show that sometimes it is efficient to add some process to the models.
Experiment Environment
The algorithm I use is kNN, k nearest neighbors. The code is from kNN by Golang from scratch. On that article, I used iris data which has 150 rows and 4 features. This time, to check the time it takes, I increased the data size, rows, by 100 times.
The data size is as following.
- rows: 15000
- features: 4
Policy to speed up
Why kNN’s prediction is slow?
Why is kNN’s prediction slow?
This is because it does main calculation on the prediction phase and on pre-phase it doesn’t calculate for prediction.
What kNN does at prediction phase is shown on the image above. The red and blue circles mean the data which were read on the before. The green one is the prediction target. Simply, kNN calculates the distance between prediction target and training data which are read before and by the majority rules of the k nearest point of the training data it predicts the label. On the case of this image, if the k=2, the nearest 3 circles from the green one are 2 blue circles and 1 red circle, meaning by majority rule, the green circle belongs to the blue circles.
As it increases the amount of the training data, it takes much time to predict.
Approach
The approach for speed-up is simple. I just thought to decrease the data points which is used at the phase of distance calculation. On naive kNN, it calculates all the distances between prediction targets and the training data. By introducing representative points to the training data, I decreased the calculations.
Concretely, I used the concept of kmeans. On the phase of reading training data, by using kmeans concept, it calculates centroid points of the subsets of the training data. On the prediction phase, it calculates the distances between those centroid points at first and get the data points which belongs to the nearest centroid point. On the next step, it calculates the distances between the prediction target and those data points. You can get the label of the prediction target by the majority rules of k nearest data points.
For example, on the case of image above, light red an blue circles mean the data and the deep blue, yellow, purple circles mean the centroid points calculated on the phase of kmeans. At first, when you read the training data, you can calculate the those three centroid points and which data points belongs to each centroid points. On the predict phase(on this case, the green circle is the prediction target), you calculates the distances between the prediction target and each of the centroid points. Here, the nearest centroid point from the green circle is the deep purple circle. Next, you calculate the distance between the green circle and circles which belong to the deep purple circle. And by the majority rule of the nearest k circles, you can judge the label the green circle belongs to. On this case, blue.
Code I used
By adding kmeans concept to the code from kNN by Golang from scratch, I did speed up.
Concretely, I added kmeans to the fit() method, originally it just read training data, and made KNN constructor have classified cluster and centroid points. On the prediction phase, it calculates the distance between prediction target and those centroid points at first and on the next step, it calculates the distances between prediction targets and the data which belongs to the nearest centroid point.
The code is as following.
package main
import (
"math"
"sort"
"os"
"encoding/csv"
"io"
"strconv"
"fmt"
"reflect"
"time"
"math/rand"
)
func main() {
//read data
irisMatrix := [][]string{}
iris, err := os.Open("big_iris.csv")
if err != nil {
panic(err)
}
defer iris.Close()
reader := csv.NewReader(iris)
reader.Comma = ','
reader.LazyQuotes = true
for {
record, err := reader.Read()
if err == io.EOF {
break
} else if err != nil {
panic(err)
}
irisMatrix = append(irisMatrix, record)
}
//split data into explaining and explained variables
X := [][]float64{}
Y := []string{}
for _, data := range irisMatrix {
//convert str slice data into float slice data
temp := []float64{}
for _, i := range data[:4] {
parsedValue, err := strconv.ParseFloat(i, 64)
if err != nil {
panic(err)
}
temp = append(temp, parsedValue)
}
//explaining variables
X = append(X, temp)
//explained variables
Y = append(Y, data[4])
}
//split data into training and test
var (
trainX [][]float64
trainY []string
testX [][]float64
testY []string
)
for i, _ := range X {
if i%2 == 0 {
trainX = append(trainX, X[i])
trainY = append(trainY, Y[i])
} else {
testX = append(testX, X[i])
testY = append(testY, Y[i])
}
}
//training
knn := KNN{}
knn.k = 8
knn.fit(trainX, trainY)
predicted := knn.predict(testX)
//check accuracy
correct := 0
for i, _ := range predicted {
if predicted[i] == testY[i] {
correct += 1
}
}
fmt.Println(correct)
fmt.Println(len(predicted))
fmt.Println(float64(correct) / float64(len(predicted)))
}
func Transpose(source [][]float64) [][]float64 {
var dest [][]float64
for i := 0; i < len(source[0]); i++ {
var temp []float64
for j := 0; j < len(source); j++ {
temp = append(temp, 0.0)
}
dest = append(dest, temp)
}
for i := 0; i < len(source); i++ {
for j := 0; j < len(source[0]); j++ {
dest[j][i] = source[i][j]
}
}
return dest
}
//return index when the value is the smallest
func ArgMin(target []float64) int {
var (
index int
base float64
)
for i, d := range target {
if i == 0 {
index = i
base = d
} else {
if d < base {
index = i
base = d
}
}
}
return index
}
//calculate euclidean distance betwee two slices
func Dist(source, dest []float64) float64 {
val := 0.0
for i, _ := range source {
val += math.Pow(source[i]-dest[i], 2)
}
return math.Sqrt(val)
}
//argument sort
type Slice struct {
sort.Interface
idx []int
}
func (s Slice) Swap(i, j int) {
s.Interface.Swap(i, j)
s.idx[i], s.idx[j] = s.idx[j], s.idx[i]
}
func NewSlice(n sort.Interface) *Slice {
s := &Slice{Interface: n, idx: make([]int, n.Len())}
for i := range s.idx {
s.idx[i] = i
}
return s
}
func NewFloat64Slice(n []float64) *Slice { return NewSlice(sort.Float64Slice(n)) }
//map sort
type Entry struct {
name string
value int
}
type List []Entry
func (l List) Len() int {
return len(l)
}
func (l List) Swap(i, j int) {
l[i], l[j] = l[j], l[i]
}
func (l List) Less(i, j int) bool {
if l[i].value == l[j].value {
return l[i].name < l[j].name
} else {
return l[i].value > l[j].value
}
}
//Count the frequence of each items
func Counter(target []string) map[string]int {
counter := map[string]int{}
for _, elem := range target {
counter[elem] += 1
}
return counter
}
type KNN struct {
k int
data [][]float64
labels []string
clusterLabel []int
centroid [][]float64
}
func (knn *KNN) fit(X [][]float64, Y []string) {
//read data
knn.data = X
knn.labels = Y
//assign data on clusters randomly
rand.Seed(time.Now().UnixNano())
clusterSize := len(knn.labels) / 10
for i := 0; i < len(knn.labels); i++ {
knn.clusterLabel = append(knn.clusterLabel, rand.Intn(clusterSize))
}
//calculates centroid point of each clusters
for i := 0; i < clusterSize; i++ {
var clusterGroup [][]float64
for j, cluster := range knn.clusterLabel {
if cluster == i {
clusterGroup = append(clusterGroup, knn.data[j])
}
}
transposedClusterGroup := Transpose(clusterGroup)
clusterCentroid := []float64{}
for _, values := range transposedClusterGroup {
//get average
mean := 0.0
for _, value := range values {
mean += value
}
clusterCentroid = append(clusterCentroid, mean/float64(len(values)))
}
knn.centroid = append(knn.centroid, clusterCentroid)
}
for {
//update centroid points
var tempRepresentatives [][]float64
for i, _ := range knn.centroid {
var grouped [][]float64
for j, d := range knn.data {
if knn.clusterLabel[j] == i {
grouped = append(grouped, d)
}
}
if len(grouped) != 0 {
transposedGroup := Transpose(grouped)
updated := []float64{}
for _, vectors := range transposedGroup {
value := 0.0
for _, v := range vectors {
value += v
}
//store average of each characteristics
updated = append(updated, value/float64(len(vectors)))
}
tempRepresentatives = append(tempRepresentatives, updated)
}
}
knn.centroid = tempRepresentatives
//updates label
tempLabel := []int{}
for _, d := range knn.data {
var distance []float64
for _, r := range knn.centroid {
distance = append(distance, Dist(d, r))
}
tempLabel = append(tempLabel, ArgMin(distance))
}
if reflect.DeepEqual(knn.clusterLabel, tempLabel) {
break
} else {
knn.clusterLabel = tempLabel
}
}
}
func (knn *KNN) predict(X [][]float64) []string {
predictedLabel := []string{}
for _, source := range X {
var (
centDistList []float64
distList []float64
nearLabels []string
)
//finds the nearest centroid point
//calculates the distance between the prediction target and centroid point
for _, dest := range knn.centroid {
centDistList = append(centDistList, Dist(source, dest))
}
//get the nearest centroid point's label
centS := NewFloat64Slice(centDistList)
sort.Sort(centS)
centroidTargetLabel := centS.idx[0]
//calculates the distance between prediction target and training data
var tempLabel []string
for i, dest := range knn.data {
if knn.clusterLabel[i] == centroidTargetLabel {
distList = append(distList, Dist(source, dest))
tempLabel = append(tempLabel, knn.labels[i])
}
}
//get index of the nearest k points
s := NewFloat64Slice(distList)
sort.Sort(s)
targetIndex := s.idx[:knn.k]
//get label relevant to the index
for _, ind := range targetIndex {
nearLabels = append(nearLabels, tempLabel[ind])
}
//get label frequency
labelFreq := Counter(nearLabels)
a := List{}
for k, v := range labelFreq {
e := Entry{k, v}
a = append(a, e)
}
sort.Sort(a)
predictedLabel = append(predictedLabel, a[0].name)
}
return predictedLabel
}
Compare accuracies
With naive kNN prediction, the time and accuracy is as followings.
- accuracy :0.96
- time:17.10s
- accuracy:0.96
- time:2.46s
Summary
By adding one process, I succeeded to shorten the time of prediction of the naive algorithm.
Sometimes, by adding some process to the existing algorithms, you can get better outcome(of course it depends on the goal of the model).
But other times, if you add inappropriate process to the algorithm, the accuracy and robustness can decrease.
And you need to keep in mind that the existing machine learning libraries are decent. Even when the goal is specific and focused well, in most cases, existing machine learning libraries can exceed to your original “optimized” algorithm.