Data Science Struggle: CNN + KNN model accuracy

Overview

On the contest site like Kaggle, we can see many trials and good scores by the combination of some methods.
For example, you can get scores by logistic regression and lasso regression. You can make xgboost model by using those scores.
This time, about cifar-10, I make CNN model. And by using the score, I check KNN scores.

Procedure

split data into 3
make image classification model by using first data as training data
predict second and third data by the model
make KNN model by second data’s predicted score
predict third data by the KNN model

The plot below is what I try.

Why I split data into 3 not into 2?

Usually, when we make model and predict scores, we just split data into 2. But this time, I do into 3.
On the phase of KNN modeling, I just use train_data_2 and test_data. So from the viewpoint characteristics, we expect that train_data_2 is equal to test_data.
If I just split data into 2, I have no choice but to make CNN model by train data and predict train data itself and test data. Although those predicted score can be the second data for KNN model, from the viewpoint of characteristics, train data is not equal to test data, because the second train data is got by CNN model which is trained by the data itself.
So, to make the nature of train and test data same, I need to split data into 3.

Data

I use cifar-10, which is color image data with 10 categories.

import numpy as np
import keras
from keras.datasets import cifar10
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten, Activation
from keras.regularizers import l1_l2
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

# read data
(x_train_orig, y_train_orig), (x_test, y_test) = cifar10.load_data()

# split data
x_train_1, x_train_2, y_train_1, y_test_2 = train_test_split(x_train_orig, y_train_orig, train_size=0.7)

The code above is to import library, get data and to split into 3.

CNN modeling

def model_1(x_train, y_train, conv_num, dense_num):
    input_shape = x_train.shape[1:]

    # make teacher hot-encoded
    y_train = to_categorical(y_train, 10)

    # set model
    model = Sequential()
    model.add(Conv2D(conv_num, (3,3), activation='relu', input_shape=input_shape))
    model.add(Dropout(0.2))
    model.add(Conv2D(conv_num, (3,3), activation='relu'))
    model.add(Dropout(0.2))
    model.add(MaxPooling2D(pool_size=(2,2)))

    model.add(Conv2D(conv_num * 2, (3,3), activation='relu'))
    model.add(Conv2D(conv_num * 2, (3,3), activation='relu'))
    model.add(Dropout(0.2))
    model.add(MaxPooling2D(pool_size=(2,2)))

    model.add(Flatten())
    model.add(Dense(dense_num, activation='relu', W_regularizer = l1_l2(.01)))
    model.add(Dropout(0.2))
    model.add(Dense(int(dense_num * 0.6), activation='relu', W_regularizer = l1_l2(.01)))
    model.add(Dense(10, activation='softmax'))
    model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])
    # training
    history =model.fit(x_train, y_train, batch_size=256, epochs=50, shuffle=True,  validation_split=0.1)
    return history
history_1 = model_1(x_train_1, y_train_1, 32, 256)

CNN modeling code is above. About CNN modeling itself in keras, check this article. I set 50 train epochs. Actually, this is not enough to make the parameter fixed. But this experiment is to check how KNN score by CNN score behaves. So here, I just set 50(I wait for the time GPU came to me from the sky.).
Check how the train goes by the plot.

import matplotlib.pyplot as plt
def show_history(history):
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train_accuracy', 'test_accuracy'], loc='best')
    plt.show()

show_history(history_1)

The training is not enough and test accuracy is not stable.

Predict by CNN model

predictions_1 = history_1.model.predict(x_train_2)
prediction_test = history_1.model.predict(x_test)

‘predictions_1’ is KNN model’s training data and ‘prediction_test’ is test data.

KNN model

KNN(k-nearest neighbor classifier) is simple algorithm. It decides the target label by the nearest k item’s label.
In this case, explaining variables are CNN’s score which has 10 values being relevant to 10 categories cifar-10 has.

from sklearn.neighbors import KNeighborsClassifier
# make models
knn_2 = KNeighborsClassifier(n_neighbors=2)
knn_4 = KNeighborsClassifier(n_neighbors=4)
knn_8 = KNeighborsClassifier(n_neighbors=8)
knn_16 = KNeighborsClassifier(n_neighbors=16)
knn_32 = KNeighborsClassifier(n_neighbors=32)

knn_2.fit(predictions_1, y_test_2)
knn_4.fit(predictions_1, y_test_2)
knn_8.fit(predictions_1, y_test_2)
knn_16.fit(predictions_1, y_test_2)
knn_32.fit(predictions_1, y_test_2)

# predict
kn_2_pr = knn_2.predict(prediction_test)
kn_4_pr = knn_4.predict(prediction_test)
kn_8_pr = knn_8.predict(prediction_test)
kn_16_pr = knn_16.predict(prediction_test)
kn_32_pr = knn_32.predict(prediction_test)

The code above makes 5 models and does prediction.
Let’s check those accuracies.

from sklearn.metrics import accuracy_score

for i in range(1,6):
    val = str(pow(2, i))
    eval("print(\"k=" + val + ":{}\".format(accuracy_score(kn_" + val + "_pr, y_test)))")

The outcome is like this.

k=2:0.447
k=4:0.5049
k=8:0.5302
k=16:0.5492
k=32:0.5599

In the situation which meets the following conditions, some KNN model’s accuracy is better than just CNN’s.

data: cifar-10
train epoch is not enough

On another article, I’ll try to train the CNN more and check KNN accuracy.

Data Science Struggle

Tuesday, June 27, 2017

CNN + KNN model accuracy