Overview
I wrote kNN(k nearest neighbors), one of the machine learning algorithms, by Go.Go has some machine learning packages but it is hard to find the information of how to write machine learning algorithms by Go. So I stepwise introduce how to. This time kNN.
This article is to understand how the algorithm. So I don’t use some elements to improve the accuracy if it disturbs the understanding essential points.
What is kNN?
kNN is one of the supervised leaning algorithms. It predicts the data label by distance and majority rules.Algorithm
kNN algorithm is very simple. When it predicts data label, it takes the nearest k labels and by majority rule, the label of data is predicted.This algorithm apart from other supervised ones doesn’t have the concept of weights to update. It just keep the data which is composed of explaining variables and labels. When it predicts the target data, it calculates the distance between the target and the data which is kept. So, necessarily, the amount of time for predictions depends on the amount of data which is kept. It is sometimes huge deficit to take much time on the prediction phase. On the other hand, just simple distance calculation and majority rule makes kNN algorithm, meaning it is easy to use it with other algorithms without thinking many details. This is huge advantage.
Let's think the example.
Code
The code I used is as following.package main
import (
"math"
"sort"
"os"
"encoding/csv"
"io"
"strconv"
"fmt"
)
func main() {
//read data
irisMatrix := [][]string{}
iris, err := os.Open("iris.csv")
if err != nil {
panic(err)
}
defer iris.Close()
reader := csv.NewReader(iris)
reader.Comma = ','
reader.LazyQuotes = true
for {
record, err := reader.Read()
if err == io.EOF {
break
} else if err != nil {
panic(err)
}
irisMatrix = append(irisMatrix, record)
}
//split data into explaining and explained variables
X := [][]float64{}
Y := []string{}
for _, data := range irisMatrix {
//convert str slice data into float slice data
temp := []float64{}
for _, i := range data[:4] {
parsedValue, err := strconv.ParseFloat(i, 64)
if err != nil {
panic(err)
}
temp = append(temp, parsedValue)
}
//explaining variables
X = append(X, temp)
//explained variables
Y = append(Y, data[4])
}
//split data into training and test
var (
trainX [][]float64
trainY []string
testX [][]float64
testY []string
)
for i, _ := range X {
if i%2 == 0 {
trainX = append(trainX, X[i])
trainY = append(trainY, Y[i])
} else {
testX = append(testX, X[i])
testY = append(testY, Y[i])
}
}
//training
knn := KNN{}
knn.k = 8
knn.fit(trainX, trainY)
predicted := knn.predict(testX)
//check accuracy
correct := 0
for i, _ := range predicted {
if predicted[i] == testY[i] {
correct += 1
}
}
fmt.Println(correct)
fmt.Println(len(predicted))
fmt.Println(float64(correct) / float64(len(predicted)))
}
//calculate euclidean distance betwee two slices
func Dist(source, dest []float64) float64 {
val := 0.0
for i, _ := range source {
val += math.Pow(source[i]-dest[i], 2)
}
return math.Sqrt(val)
}
//argument sort
type Slice struct {
sort.Interface
idx []int
}
func (s Slice) Swap(i, j int) {
s.Interface.Swap(i, j)
s.idx[i], s.idx[j] = s.idx[j], s.idx[i]
}
func NewSlice(n sort.Interface) *Slice {
s := &Slice{Interface: n, idx: make([]int, n.Len())}
for i := range s.idx {
s.idx[i] = i
}
return s
}
func NewFloat64Slice(n []float64) *Slice { return NewSlice(sort.Float64Slice(n)) }
//map sort
type Entry struct {
name string
value int
}
type List []Entry
func (l List) Len() int {
return len(l)
}
func (l List) Swap(i, j int) {
l[i], l[j] = l[j], l[i]
}
func (l List) Less(i, j int) bool {
if l[i].value == l[j].value {
return l[i].name < l[j].name
} else {
return l[i].value > l[j].value
}
}
//count item frequence in slice
func Counter(target []string) map[string]int {
counter := map[string]int{}
for _, elem := range target {
counter[elem] += 1
}
return counter
}
type KNN struct {
k int
data [][]float64
labels []string
}
func (knn *KNN) fit(X [][]float64, Y []string) {
//read data
knn.data = X
knn.labels = Y
}
func (knn *KNN) predict(X [][]float64) []string {
predictedLabel := []string{}
for _, source := range X {
var (
distList []float64
nearLabels []string
)
//calculate distance between predict target data and surpervised data
for _, dest := range knn.data {
distList = append(distList, Dist(source, dest))
}
//take top k nearest item's index
s := NewFloat64Slice(distList)
sort.Sort(s)
targetIndex := s.idx[:knn.k]
//get the index's label
for _, ind := range targetIndex {
nearLabels = append(nearLabels, knn.labels[ind])
}
//get label frequency
labelFreq := Counter(nearLabels)
//the most frequent label is the predict target label
a := List{}
for k, v := range labelFreq {
e := Entry{k, v}
a = append(a, e)
}
sort.Sort(a)
predictedLabel = append(predictedLabel, a[0].name)
}
return predictedLabel
}
Check one by one
Let’s check from the kNN algorithm part.
type KNN struct {
k int
data [][]float64
labels []string
}
kNN structure has k, data and label. k means the number of nearest points which the predict target uses for prediction. data is the variable to store training data. label is the variable to store training data's labels.
func (knn *KNN) fit(X [][]float64, Y []string) {
//read data
knn.data = X
knn.labels = Y
}
On this part, data is read. Usually, on other machine learning algorithm, fit() method does updating weights and so on. But on kNN, main calculation is done on prediction phase and the role of fit() is just to read data.
func (knn *KNN) predict(X [][]float64) []string {
predictedLabel := []string{}
for _, source := range X {
var (
distList []float64
nearLabels []string
)
//calculate distance between predict target data and surpervised data
for _, dest := range knn.data {
distList = append(distList, Dist(source, dest))
}
//take top k nearest item's index
s := NewFloat64Slice(distList)
sort.Sort(s)
targetIndex := s.idx[:knn.k]
//get the index's label
for _, ind := range targetIndex {
nearLabels = append(nearLabels, knn.labels[ind])
}
//get label frequency
labelFreq := Counter(nearLabels)
//the most frequent label is the predict target label
a := List{}
for k, v := range labelFreq {
e := Entry{k, v}
a = append(a, e)
}
sort.Sort(a)
predictedLabel = append(predictedLabel, a[0].name)
}
return predictedLabel
}
Prediction part. On kNN, here, the main calculation is done.
It is bit long. So I subdivide it into smaller parts.
//calculate distance between predict target data and surpervised data
for _, dest := range knn.data {
distList = append(distList, Dist(source, dest))
}
It stores the distance between prediction target data and each stored data to distList. The function used here is as following.
//calculate euclidean distance between two slices
func Dist(source, dest []float64) float64 {
val := 0.0
for i, _ := range source {
val += math.Pow(source[i]-dest[i], 2)
}
return math.Sqrt(val)
}
This function takes 2 slices and returns the distance between those. This time, I used euclidean distance.
//take top k nearest item's index
s := NewFloat64Slice(distList)
sort.Sort(s)
targetIndex := s.idx[:knn.k]
Here, it sorts the distance and takes top k index.
//get the index's label
for _, ind := range targetIndex {
nearLabels = append(nearLabels, knn.labels[ind])
}
The labels which is relevant to the top k index are stored to nearLabels.
//get label frequency
labelFreq := Counter(nearLabels)
//the most frequent label is the predict target label
a := List{}
for k, v := range labelFreq {
e := Entry{k, v}
a = append(a, e)
}
sort.Sort(a)
predictedLabel = append(predictedLabel, a[0].name)
Counter function stores the elements-frequency data as map. And by map sort, it returns the most frequent label. This is the prediction.
Accuracy
On this case, the prediction accuracy to test data is more or less 0.97.