Data Science Struggle: AUC as an objective function: with Julia and Optim.jl package

Abstract

On this article, I'll do AUC optimization on logistic regression. With Julia's Optim package, relatively easily, we can optimize AUC objective function.

Julia: Version 0.6.3

Optim package

The link below is the official page of Optim.jl package. With this package, relatively easily, we can optimize the indifferentiable function.

https://github.com/JuliaNLSolvers/Optim.jl

Honestly, I didn't read all the documents and source code yet. So, some parts of the codes shown on this article can be improper or wrong. If you find something, please let me know.

Work Flow

The goal of this article is to make logistic regression model to the data about default of credit card clients with AUC objective function. To the data, I wrote about simple analysis flow to the data on the articles, Simple analysis workflow to data about default of credit card clients and Follow simple analysis workflow with Julia.

Anyway, this article goes to the goal with the small steps. I'll follow the steps below.

follow simple example of Optim package
make linear regression with Optim package
optimize AUC objective function with IRIS datasets
optimize AUC objective function with data about default of credit card clients

Follow simple example of Optim package

At first, I'll follow the example which is shown in the official page of Optim.jl. The optimization target is the rosenbrock() function. With (x[1], x[2]) = (1.0, 1.0), the function will take the minimum value.
optimize() function takes the objective target function, initial values for the parameters and optimization method as arguments.

using Optim
rosenbrock(x) =  (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
result = optimize(rosenbrock, zeros(2), NelderMead())

Results of Optimization Algorithm
 * Algorithm: Nelder-Mead
 * Starting Point: [0.0,0.0]
 * Minimizer: [0.9999634355313174,0.9999315506115275]
 * Minimum: 3.525527e-09
 * Iterations: 60
 * Convergence: true
   *  √(Σ(yᵢ-ȳ)²)/n < 1.0e-08: true
   * Reached Maximum Number of Iterations: false
 * Objective Calls: 117

From the output, we can see the Minimizer is [0.9999634355313174,0.9999315506115275]. These are almost [1.0, 1.0].

Make linear regression with Optim package

By simple regression case, let's check how it works.
Here, we will think about minimizing the following $J$ .
(assume there is just one explaining variable)
$f(x_i) = w_0 + w_1x_i$
$J = \sum_{i=1}^{n}(y_i-f(x_i))^2$
On the practical context, we basically want to know the $f()$ , to say precisely, parameters $w_0$ and $w_1$ . So, when we let ( $w_0$ , $w_1$ ) $W$ , the parameters we want to know are expressed as following.
$\hat{W} = \underset{W}{argmin}(J)$
With Optim package, by defining the $J$ above, we can get the $\hat{W}$ . As a default, the optimization algorithm is Nelder-Mead.

# regression
function objective(w)
    x = [1.0, 2.0, 3.0]
    y = 3.0x + 10.0
    return sum(abs2.(y - (x * w[1] + w[2])))
end

regOpt = optimize(objective, [0.0, 0.0])

Results of Optimization Algorithm
 * Algorithm: Nelder-Mead
 * Starting Point: [0.0,0.0]
 * Minimizer: [2.9999608123909227,10.000056862432896]
 * Minimum: 4.459737e-09
 * Iterations: 61
 * Convergence: true
   *  √(Σ(yᵢ-ȳ)²)/n < 1.0e-08: true
   * Reached Maximum Number of Iterations: false
 * Objective Calls: 121

On the example above, the data is generated with $y = 3.0x + 10.0$ . The output, regOpt, shows [2.9999608123909227,10.000056862432896]. It is almost same as [3.0, 10.0].

Optimize AUC objective function with IRIS datasets

To the IRIS dataset, I'll make logistic regression model with AUC objective function.
At first, from RDatasets package, we can read IRIS datasets and prepare for the model. It is not good hack to use global variables to store the data. So, in practical situation, don't use this way.
Anyway, here, I'll set the task that we need to classify “versicolor” and others. So, “versicolor” is converted to 1 and others to 0.

using RDatasets

iris = dataset("datasets", "iris")

standardize(x) = (x - mean(x)) / std(x)

global sepalLength = standardize(Array(iris[:SepalLength]))
global sepalWidth = standardize(Array(iris[:SepalWidth]))
global petalLength = standardize(Array(iris[:PetalLength]))
global petalWidth = standardize(Array(iris[:PetalWidth]))
global species = map(x->x=="versicolor"?1:0, Array(iris[:Species]))

The process of calculating AUC score is as following.

$\begin{eqnarray} S(x_p, x_n)=\left\{ \begin{array}{ll} 1 & x_p > x_n \\ 0.5 & x_p = x_n \\ 0 & x_p < x_n \\ \end{array} \right. \end{eqnarray}$

$AUC = \frac{1}{n_pn_n}\sum_{1}^{n_n}\sum_{1}^{n_p}S(x_p, x_n)$
Naively, I'll write the objective function based on this. This AUC objective function is indifferentiable. So, as an optimizer, I'll use NelderMead() here. Without giving optimization method to the function optimize() as an argument, it will adapt NelderMead() as a default.

About AUC objective function, there are some papers, I think, I should read and I'm not catching up with it yet.
Because of this, the objective function I wrote below just follows the definition of AUC naively.

function objective(w::Array{Float64})

    # logistic regression
    logistic(t) = (1 + e^(-t))^(-1)
    t = w[1] + w[2] * sepalLength + w[3] * sepalWidth + w[4] * petalLength + w[5] * petalWidth
    yPred = map(logistic, t)

    positiveIndex = find(1 .== species)
    negativeIndex = find(0 .== species)

    function opt()

        s = 0
        for pI in positiveIndex
            for nI in negativeIndex
                if yPred[pI] > yPred[nI]
                    s += 1
                end
            end
        end

        return s
    end

    return -opt()
end

lr = optimize(objective, [0.0, 0.0, 0.0, 0.0, 0.0])

Results of Optimization Algorithm
 * Algorithm: Nelder-Mead
 * Starting Point: [0.0,0.0,0.0,0.0,0.0]
 * Minimizer: [0.026882886501954997,-0.004107454369796652, ...]
 * Minimum: -4.156000e+03
 * Iterations: 94
 * Convergence: true
   *  √(Σ(yᵢ-ȳ)²)/n < 1.0e-08: true
   * Reached Maximum Number of Iterations: false
 * Objective Calls: 238

The estimated parameters are the Minimizer above. By using these parameters, I'll do prediction. I didn't split IRIS data into train and test ones. So, here, I'll use the train data even as a test ones. This is not good manner. But, here, just to check the flow, I'm using IRIS dataset. So, don't care.

w = lr.minimizer
logistic(t) = (1 + e^(-t))^(-1)
t = w[1] + w[2] * sepalLength + w[3] * sepalWidth + w[4] * petalLength + w[5] * petalWidth
yPred = map(logistic, t)

To evaluate the model, we can use AUC score. To calculate it, I wrote function and left on my own tool box. If you feel annoying to write by yourself, you can use from the following link.

https://github.com/marubontan/MLUtils.jl

include("./MLUtils.jl/src/utils.jl")

@show auc(species, yPred)

auc(species, yPred) = 0.8312

The AUC score is 0.8312.
Next, we can visualize the outcome and check the behavior.

using PyPlot

positiveIndex = find(1 .== species)
negativeIndex = find(0 .== species)


plt[:hist](yPred[positiveIndex], alpha=0.5, color="blue")
plt[:hist](yPred[negativeIndex], alpha=0.5, color="red")

As characteristics, we can say two things. One, there are two mountains for positive and negative. Two, the score are distributed in a narrow range. AUC calculation is based on the order. So, its score can gather in narrow area.

Optimize AUC objective function with data about default of credit card clients

To the data about default of credit card clients, I'll try same thing as to IRIS data set. Also, by comparing with the outcome of logistic regression without AUC objective function, I'll see the difference.
About the data, please check the link below.

Simple analysis workflow to data about default of credit card clients

About pre-processing of the data with Julia, if you want to know the detail, you can read the following article too.

Follow simple analysis workflow with Julia

Roughly, the code below is to read data, omit some columns and to do one-hot-encoding.

using DataFrames
dataOrig = readtable("./data/UCI_Credit_Card.csv", header=true)

# omit-target columns
omitTargetLabel = [:ID]

# categorical columns
payLabel = ["PAY_"*string(i) for i in range(0,7) if i != 1]
categoricalLabel = ["SEX", "EDUCATION", "MARRIAGE"]

append!(categoricalLabel, payLabel)

categoricalLabel = Symbol.(categoricalLabel)

oneHotEncoded = oneHotEncode(dataOrig[categoricalLabel])

delete!(dataOrig, categoricalLabel)
delete!(dataOrig, omitTargetLabel)

data = hcat(dataOrig, oneHotEncoded)

Next, I'll choose some columns, standardize those, split the data into train and test datasets. Here, I'll use ten explaining variables and one explained variable. Those ten explaining variables are chosen based on the univariate analysis I did on Follow simple analysis workflow with Julia. The top ten variables about AUC scores will be used here.

using MLDataUtils
using GLM

targetColumns = [:PAY_2_2, :PAY_0_0, :PAY_0_2, :LIMIT_BAL, :PAY_AMT1, :PAY_AMT2, :PAY_3_2, :PAY_2_0, :PAY_AMT3, :PAY_4_2, :default_payment_next_month]
dataColumnLimited = data[targetColumns]

# standardization
for col in targetColumns
    if col == :default_payment_next_month
        continue
    end
    dataColumnLimited[col] = (dataColumnLimited[col] - mean(dataColumnLimited[col]))/std(dataColumnLimited[col])
end

# split data into train and test ones
trainData, testData = splitobs(shuffleobs(dataColumnLimited), at=0.7)

With GLM package, I'll make a logistic regression model. In many cases, when we make logistic regression model, we will follow this way.

logit = glm(@formula(default_payment_next_month ~ PAY_2_2 + PAY_0_0 + PAY_0_2 + LIMIT_BAL + PAY_AMT1 + PAY_AMT2 + PAY_3_2 + PAY_2_0 + PAY_AMT3 + PAY_4_2), trainData, Binomial(), LogitLink())

Do prediction to the test dataset and check the AUC score.

logisticScore = predict(logit, testData[[:PAY_2_2, :PAY_0_0, :PAY_0_2, :LIMIT_BAL, :PAY_AMT1, :PAY_AMT2, :PAY_3_2, :PAY_2_0, :PAY_AMT3, :PAY_4_2]])
@show auc(Int.(testData[:default_payment_next_month]), Float64.(logisticScore))

auc(Int.(testData[:default_payment_next_month]), Float64.(logisticScore)) = 0.754005194606525

The AUC score is 0.754005194606525.
By visualizing, we can check the behavior.

using PyPlot

positiveIndex = find(1 .== testData[:default_payment_next_month])
negativeIndex = find(0 .== testData[:default_payment_next_month])


plt[:hist](logisticScore[positiveIndex], alpha=0.5, color="blue")
plt[:hist](logisticScore[negativeIndex], alpha=0.5, color="red")

Next, with AUC objective function, I'll make logistic model. Same as before, prepare the explaining variables. As I wrote, this way is bad hack because of the global variables.

global pay22 = Array(trainData[:PAY_2_2])
global pay00 = Array(trainData[:PAY_0_0])
global pay02 = Array(trainData[:PAY_0_2])
global limit = Array(trainData[:LIMIT_BAL])
global payamt1 = Array(trainData[:PAY_AMT1])
global payamt2 = Array(trainData[:PAY_AMT2])
global pay32 = Array(trainData[:PAY_3_2])
global pay20 = Array(trainData[:PAY_2_0])
global payamt3 = Array(trainData[:PAY_AMT3])
global pay42 = Array(trainData[:PAY_4_2])
global default = Array(trainData[:default_payment_next_month])

The code below is to optimize the AUC objective function. With my PC environment, it takes time.

function objective(w::Array{Float64})

    # logistic regression
    logistic(t) = (1 + e^(-t))^(-1)
    t = w[1] + w[2] * pay22 + w[3] * pay00 + w[4] * pay02 + w[5] * limit + w[6] * payamt1 + w[7] * payamt2 + w[8] * pay32 + w[9] * pay20 + w[10] * payamt3 + w[11] * pay42
    yPred = map(logistic, t)

    positiveIndex = find(1 .== default)
    negativeIndex = find(0 .== default)

    function opt()

        s = 0
        for pI in positiveIndex
            for nI in negativeIndex
                if yPred[pI] > yPred[nI]
                    s += 1
                end
            end
        end

        return s
    end

    return -opt()
end

lr = optimize(objective, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

Results of Optimization Algorithm
 * Algorithm: Nelder-Mead
 * Starting Point: [0.0,0.0, ...]
 * Minimizer: [-0.024256474906223353,0.03245631168471568, ...]
 * Minimum: -5.838430e+07
 * Iterations: 744
 * Convergence: true
   *  √(Σ(yᵢ-ȳ)²)/n < 1.0e-08: true
   * Reached Maximum Number of Iterations: false
 * Objective Calls: 2223

By using the parameters, do prediction to the test data and check the AUC score.

w = lr.minimizer
logistic(t) = (1 + e^(-t))^(-1)
t = w[1] + w[2] * pay22 + w[3] * pay00 + w[4] * pay02 + w[5] * limit + w[6] * payamt1 + w[7] * payamt2 + w[8] * pay32 + w[9] * pay20 + w[10] * payamt3 + w[11] * pay42
aucScore = map(logistic, t)

@show auc(Int.(default), aucScore)

auc(Int.(default), aucScore) = 0.7627524306198501

plt[:hist](aucScore[positiveIndex], alpha=0.5, color="blue")
plt[:hist](aucScore[negativeIndex], alpha=0.5, color="red")

Data Science Struggle

Thursday, June 28, 2018

AUC as an objective function: with Julia and Optim.jl package