Kuzushiji-MNIST exploring

Overview

Kuzushiji-MNIST is MNIST like data set based on classical Japanese letters.
The following image is part of the data set. As you can see, this is composed of visually complex letters.

sample_images

On this article, I’ll do simple introduction of Kuzushiji-MNIST and classification with Keras model.

License

“KMNIST Dataset” (created by CODH), adapted from “Kuzushiji Dataset” (created by NIJL and others), doi:10.20676/00000341

Data

You can download the data set from https://github.com/rois-codh/kmnist. On this post, I’ll use NumPy formatted data.(I didn’t know this way of saving data.)
To execute the following exploring, you need to download four NumPy data, kmnist-train-imgs.npz, kmnist-train-labels.npz, kmnist-test-imgs.npz and kmnist-test-labels.npz on the same directory.

Exploring

To explore the data set, we will load the NumPy format data. By using the numpy.load() method, it is easy to do.

import numpy as np

DATA_PATH = '/Your/Data/Directory'

train_data = np.load(DATA_PATH + '/kmnist-train-imgs.npz')['arr_0']
train_label = np.load(DATA_PATH + '/kmnist-train-labels.npz')['arr_0']
test_data = np.load(DATA_PATH + '/kmnist-test-imgs.npz')['arr_0']
test_label = np.load(DATA_PATH + '/kmnist-test-labels.npz')['arr_0']

If you check the data shape, you can notice that the shape of the data set is same as the MNIST.

print("train_data: {}".format(train_data.shape))
print("train_label: {}".format(train_label.shape))
print("test_data: {}".format(test_data.shape))
print("test_label: {}".format(test_label.shape))

train_data: (60000, 28, 28)
train_label: (60000,)
test_data: (10000, 28, 28)
test_label: (10000,)

This is gray-scaled image data set. So, let’s check one image by plotting.

import matplotlib.pyplot as plt
plt.imshow(train_data[0])
plt.gray()
plt.show()

Apparently, this is a letter and you can read it… No way. Although I’m Japanese, I’m not able to read this letter. This data set is composed of cursive Japanese letters. Even for Japanese, it is hard to recognize everything.

By plotting some more examples, you can notice the visual complexity of the letters.

fig, axes = plt.subplots(5,5,sharex=True, sharey=True)
for i,data in enumerate(train_data[:25]):
    row_columns = divmod(i, 5)
    axes[row_columns[0], row_columns[1]].imshow(data)

plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0, hspace=0)
plt.show()

At first, I thought I would do dimension reduction, clustering and so on. But, I decided just to do simple classification with Deep Neural Network with Keras. About the further exploration, I’ll do in an another article…maybe.
Anyway, the Deep Neural Network model is very simple. The model is composed of two hidden layers and input, output layer.

from keras.layers import Dense, Input
from keras.layers.normalization import BatchNormalization
from keras.models import Model
from keras.utils import to_categorical
import keras

def simple_model(x_train, y_train):
    inputs = Input(shape=(784,))
    x = BatchNormalization()(inputs)
    x = Dense(512, activation='relu')(x)
    x = Dense(256, activation='relu')(x)
    predictions = Dense(10, activation='softmax')(x)
    
    model = Model(input=inputs, output=predictions)
    
    
    epochs = 50
    learning_rate = 1e-3
    decay_rate = learning_rate / epochs
    model.compile(loss=keras.losses.categorical_crossentropy, 
                  optimizer=keras.optimizers.SGD(lr=learning_rate, momentum=0.8, decay=decay_rate), 
                  metrics=['acc'])
    
    data_shape = x_train.shape
    history = model.fit(x_train.reshape(data_shape[0], data_shape[1] * data_shape[2]), 
                        to_categorical(y_train), epochs=epochs, shuffle=True, validation_split=0.3)
    return history

By training, you can see that the validation accuracy is around 95%.

simple_history = simple_model(train_data, train_label)

Train on 42000 samples, validate on 18000 samples
Epoch 1/50
42000/42000 [==============================] - 9s 215us/step - loss: 0.8146 - acc: 0.7566 - val_loss: 0.4969 - val_acc: 0.8526
Epoch 2/50
42000/42000 [==============================] - 8s 199us/step - loss: 0.4448 - acc: 0.8654 - val_loss: 0.3850 - val_acc: 0.8845
..........................
..........................
Epoch 49/50
42000/42000 [==============================] - 9s 206us/step - loss: 0.0329 - acc: 0.9932 - val_loss: 0.1672 - val_acc: 0.9528
Epoch 50/50
42000/42000 [==============================] - 9s 206us/step - loss: 0.0328 - acc: 0.9933 - val_loss: 0.1671 - val_acc: 0.9521

By visualizing, we can check how the training went on.

def show_history(history):
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train_accuracy', 'validation_accuracy'], loc='best')
    plt.show()

show_history(simple_history)

In the end, I’ll evaluate the model with the test data. The accuracy is around 90%. This time, the model is very rough. This was to be expected.

simple_history.model.evaluate(reshaped_test_data, to_categorical(test_label))[1]

0.8953

Apparently, Kuzushiji-MNIST looks quite odd. But by following the same manner to MNIST, we can make simple classification model. This is not strange data set at all. I think that for simple experiment, we can use this data set.

Data Science Struggle

Sunday, December 9, 2018