Tuesday, June 5, 2018

Various ways of writing Neural Network with Flux: to write complex model

Abstract

Flux is one of the deep learning packages in Julia. It is flexible and easy to use. But, there are not enough examples to grasp the points, although the official documents and model zoo somehow work. So, here, on this article, I'll write down some types of model and the points where I was caught. I'm still on the phase of exploring Flux by reading the source code and trial-error. So, if you find something strange or mistake, please let me know.
On this article, I'll use Julia version 0.6.2.




Exploring

At first, for the exploring, I'll make three types of data: dataA, testA and dataSet.

using Flux

dataA = [1,2,3]
testA = [1.0, 1.0, 1.0]

srand(1234)
dataB = rand(100, 3)
labelB = dataB[:, 1] * 10.0 + dataB[:, 2] * 100.0 + dataB[:, 3] * 1000.0

dataSet = []
for i in 1:length(labelB)
    push!(dataSet, (dataB[i,:], labelB[i]))
end

Before making the models, it is better to check how function for layer works. With Flux, basically, the layer has input and output. By training, we can update the weights.

# simple dense layer
denseA = Dense(3,1)
@show denseA(dataA)
denseA(dataA) = param([2.75524])

Next, I'll make some types of model which don't have any problem. The function Chain() enables us to concisely write the models. The following code express three types of models: modelA, modelB and modelC.
  • modelA: the simple model using Chain function
  • modelB: the model written by Chain contains another model also written by Chain
  • modelC: the model using Chain with user-defined function

These models accept input data and output the calculated values with initial weights.

# simple chain
modelA = Chain(
    Dense(3, 5, relu),
    Dense(5, 1, identity)
)
@show modelA(dataA)

# model in model
modelInsideA = Chain(
    Dense(3, 5, relu),
    Dense(5, 3, relu)
)
modelB = Chain(
    modelInsideA,
    Dense(3, 1, identity)
)
@show modelB(dataA)

# use user-defined layer
function OriginalLayer()
    return x -> x
end

modelC = Chain(
    Dense(3, 5),
    OriginalLayer(),
    Dense(5, 1, identity)
)
@show modelC(dataA)
modelA(dataA) = param([-0.251308])
modelB(dataA) = param([-0.196907])
modelC(dataA) = param([7.42981])

The models above can be trained with proper training data set. The following part of the codes will train the model and calculate the output to test data.

using Flux: @epochs

lossA(x, y) = Flux.mse(modelA(x), y)
optA = SGD(Flux.params(modelA), 0.00001)
@epochs 100 Flux.train!(lossA, dataSet, optA)

lossB(x, y) = Flux.mse(modelB(x), y)
optB = SGD(Flux.params(modelB), 0.00001)
@epochs 100 Flux.train!(lossB, dataSet, optB)

lossC(x, y) = Flux.mse(modelC(x), y)
optC = SGD(Flux.params(modelC), 0.00001)
@epochs 100 Flux.train!(lossC, dataSet, optC)

# 1110
@show modelA(testA)
@show modelB(testA)
@show modelC(testA)
modelA(testA) = param([1110.01])
modelB(testA) = param([1109.55])
modelC(testA) = param([1109.99])

The output should be around 1110. So, we can say these models were trained properly.
Next, I'll write down the points where I was caught before. If you are familiar with Keras, you can think the following models are okay, no problem as I thought. The models actually works. As you can see, these accept input, dataA, and output the calculated values. The phase where I was caught was the training phase.

# Anti
# functional API like
function modelD(data)
    x = Dense(3, 2)(data)
    x = Dense(2, 1, identity)(x)
    return x
end
@show modelD(dataA)

# diverged type
function modelE(data)
    x = Dense(3, 4)(data)
    xOne = Dense(4, 2)(x)
    xTwo = Dense(4, 3)(x)
    x = vcat(xOne, xTwo)
    x = Dense(5, 1, identity)(x)
    return x
end
@show modelE(dataA)
modelD(dataA) = param([-2.59148])
modelE(dataA) = param([-2.79069])

As an experiment, let's try to make them trained with same manners as modelA, modelB and modelC before.

lossD(x, y) = Flux.mse(modelD(x), y)
optD = SGD(Flux.params(modelD), 0.0001)
@epochs 100 Flux.train!(lossD, dataSet, optD)

lossE(x, y) = Flux.mse(modelE(x), y)
optE = SGD(Flux.params(modelE), 0.001)
@epochs 100 Flux.train!(lossE, dataSet, optE)

# 1110
@show modelD(testA)
@show modelE(testA)
modelD(testA) = param([-1.06414])
modelE(testA) = param([-0.461043])

No error comes but the prediction to the testA is not valid at all. That is because the weights are not trained at all. On the phase of defining optimizer, I pointed the target weights to update. To show concretely, the following part.

optD = SGD(Flux.params(modelD), 0.0001)

If you want to know precisely and in good manner, it is better to read the source code. Here, I'll just show in rough way.
About modelD and modelE, the function Flux.params() doesn't return the target weights. On the code below, Flux.params() returns modelA's weight. On the other hand, it doesn't return the weights of modelD and modelE.

@show Flux.params(modelA)

@show Flux.params(modelD)
@show Flux.params(modelE)
Flux.params(modelA) = Any[param([0.272212 3.11675 31.3104; -0.0440446 0.486544 -0.869422; -0.478121 -0.111971 0.21574; 0.658297 -0.867973 -0.129318; 0.35043 0.610809 4.01329]), param([-0.0940614, 0.608589, -0.105823, -0.228238, 0.335226]), param([31.3904 -0.844373 -0.836109 -0.463083 4.10056]), param([2.11389])]
Flux.params(modelD) = Any[]
Flux.params(modelE) = Any[]

With Flux.params(), we can't get the weights information of the model which is defined by . One of the solutions to this is to separate the layer information and model architecture. We can make Julia's composite type have layer information. The following code is the example. By separating layer information and model architecture, I re-wrote the modelD. ModelD type contains layer information and the function Flux.treelike() enables this. On the architecture part, the function takes the ModelD type as one of the arguments.

# functional API like
struct ModelD
    w1
    w2
end
Flux.treelike(ModelD)
parametersD = ModelD(Dense(3,2), Dense(2,1,identity))

function modelDFixed(data, parameters::ModelD)
    x = parameters.w1(data)
    x = parameters.w2(x)
    return x
end

lossDFixed(x, y) = Flux.mse(modelDFixed(x, parametersD), y)
optDFixed = SGD(Flux.params(parametersD), 0.0001)
@epochs 100 Flux.train!(lossDFixed, dataSet, optDFixed)


@show modelDFixed(testA, parametersD)
modelDFixed(testA, parametersD) = param([1110.0])

About modelE, with same manner, we can re-write.

# diverged type
struct ModelE
    w1
    w2
    w3
    w4
end
Flux.treelike(ModelE)
parametersE = ModelE(Dense(3,4), Dense(4,2), Dense(4,3), Dense(5,1,identity))

function modelEFixed(data, parameters::ModelE)
    x = parameters.w1(data)
    xOne = parameters.w2(x)
    xTwo = parameters.w3(x)
    x = vcat(xOne, xTwo)
    x = parameters.w4(x)
    return x
end

lossEFixed(x, y) = Flux.mse(modelEFixed(x, parametersE), y)
optEFixed = SGD(Flux.params(parametersE), 0.00001)
@epochs 100 Flux.train!(lossEFixed, dataSet, optEFixed)


@show modelEFixed(testA, parametersE)
modelEFixed(testA, parametersE) = param([1110.0])