Sunday, April 1, 2018

kNN by Julia from scratch

Abstract

On this article, I'll write naive kNN algorithm with Julia. Recently, I started to use Julia and need to practice. kNN is relatively simple algorithm and nice for practice. So, here, I'll write simple kNN with Julia.



kNN

About kNN itself, please check the following articles.
Simply, on kNN, we calculate the distance between target point and train data points. And the nearest k point's labels are used for prediction in majority rule.

Code

On this article, I'll write kNN just as practice. The code is not for library. So I don't do proper dealing with error, exception and so on. And sometimes, as Julia code, some part are probably inappropriate. I'll just write the kNN algorithm which works.
At first, we need to read necessary packages and make KNN type.

using DataFrames, CSV, DataStructures


type KNN
    x::DataFrames.DataFrame
    y::DataFrames.DataFrame
end

In kNN, we will calculate the distances between two points. The function calcDist() is for this. Probably, Julia has the function for distance calculation and actually the following function doesn't do some necessary step like checking the lengths of two points array. But don't care.

function calcDist(sourcePoint, destPoint)
    sum = 0
    for i in 1:length(sourcePoint)
        sum += (destPoint[i] - sourcePoint[i]) ^ 2
    end
    dist = sqrt(sum)
    return dist
end

Simple kNN is based on majority rule. The following function is to get the frequency of labels and the top frequency's label.

function extractTop(targetCandidates)
    targetFrequency = counter(targetCandidates)

    normValue = 0
    normKey = "hoge"

    for key in keys(targetFrequency)
        if targetFrequency[key] > normValue
            normKey = key
            normValue = targetFrequency[key]
        end
    end
    return normKey
end

We will split the data into train and test ones. This function is for this.

function splitTrainTest(data, at = 0.7)
    n = nrow(data)
    ind = shuffle(1:n)
    train_ind = view(ind, 1:floor(Int, at*n))
    test_ind = view(ind, (floor(Int, at*n)+1):n)
    return data[train_ind,:], data[test_ind,:]
end

Usually kNN algorithm doesn't need special work on fit() phase, meaning there is no training phase. So, main function is this predict().
Here, the distances between target points and train data points are calculated. By the nearest k point’s label, we can decide the target's label on majority rule.
The function, sortperm(), doesn't have intuitive name. This function is to sort the data and return the index of sorted data.

function predict(data::KNN, testData::DataFrames.DataFrame, k=5)
    predictedLabels = []
    for i in 1:size(testData, 1)
        sourcePoint = Array(testData[i,:])
        distances = []
        for j in 1:size(data.x, 1)
            destPoint = Array(data.x[j,:])
            distance = calcDist(sourcePoint, destPoint)
            push!(distances, distance)
        end
        sortedIndex = sortperm(distances)
        targetCandidates = Array(data.y)[sortedIndex[1:k]]
        predictedLabel = extractTop(targetCandidates)
        push!(predictedLabels, predictedLabel)
    end
    return predictedLabels
end
On main() function, we can execute the kNN. The step is as followings.
  • read data
  • split data into train and test
  • predict the label of test data by kNN
  • evaluation
About evaluation, here, I used naive accuracy.

function main()
    # read data
    df = readtable("iris.csv", header=true)

    # split data into train and test
    trainData, testData = splitTrainTest(df)

    xTrain = trainData[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]]
    yTrain = trainData[:, [:Species]]
    xTest = testData[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]]
    yTest = testData[:, [:Species]]

    # predict the label of test data by kNN
    knn = KNN(xTrain, yTrain)
    predicted = predict(knn, xTest)

    # evaluation
    accurate = 0
    yTestArray = Array(yTest)
    for i in 1:length(predicted)
        if yTestArray[i] == predicted[i]
            accurate += 1
        end
    end
    println(accurate/length(predicted))
end

When I execute the code, the output was 0.9555555555555556. Actually, on this code, many cases are not covered. But as a simple practice, it works.

Whole Code

using DataFrames, CSV, DataStructures


type KNN
    x::DataFrames.DataFrame
    y::DataFrames.DataFrame
end

function predict(data::KNN, testData::DataFrames.DataFrame, k=5)
    predictedLabels = []
    for i in 1:size(testData, 1)
        sourcePoint = Array(testData[i,:])
        distances = []
        for j in 1:size(data.x, 1)
            destPoint = Array(data.x[j,:])
            distance = calcDist(sourcePoint, destPoint)
            push!(distances, distance)
        end
        sortedIndex = sortperm(distances)
        targetCandidates = Array(data.y)[sortedIndex[1:k]]
        predictedLabel = extractTop(targetCandidates)
        push!(predictedLabels, predictedLabel)
    end
    return predictedLabels
end

function calcDist(sourcePoint, destPoint)
    sum = 0
    for i in 1:length(sourcePoint)
        sum += (destPoint[i] - sourcePoint[i]) ^ 2
    end
    dist = sqrt(sum)
    return dist
end

function extractTop(targetCandidates)
    targetFrequency = counter(targetCandidates)

    normValue = 0
    normKey = "hoge"

    for key in keys(targetFrequency)
        if targetFrequency[key] > normValue
            normKey = key
            normValue = targetFrequency[key]
        end
    end
    return normKey
end

function splitTrainTest(data, at = 0.7)
    n = nrow(data)
    ind = shuffle(1:n)
    train_ind = view(ind, 1:floor(Int, at*n))
    test_ind = view(ind, (floor(Int, at*n)+1):n)
    return data[train_ind,:], data[test_ind,:]
end

function main()
    df = readtable("iris.csv", header=true)

    trainData, testData = splitTrainTest(df)

    xTrain = trainData[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]]
    yTrain = trainData[:, [:Species]]
    xTest = testData[:, [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]]
    yTest = testData[:, [:Species]]

    knn = KNN(xTrain, yTrain)
    predicted = predict(knn, xTest)

    accurate = 0
    yTestArray = Array(yTest)
    for i in 1:length(predicted)
        if yTestArray[i] == predicted[i]
            accurate += 1
        end
    end
    println(accurate/length(predicted))
end

main()