Data Science Struggle: Follow simple analysis workflow with Julia

Sunday, June 24, 2018

Follow simple analysis workflow with Julia

Abstract

On this article, with Julia I'll roughly reproduce the simple analysis I did on Simple analysis workflow to data about default of credit card clients.
After I wrote that article, I thought to write the following ones. But, I want to follow the same flow with Julia at first. So, I'll do.

Work Flow

The main purpose of this article is to follow the flow of the article below with Julia.

Simple analysis workflow to data about default of credit card clients

About the detail, please check the article above. The steps are roughly as followings. On this article, some steps I'll skip.

check data
pre-processing
make a model with one specific variable
do simple univariate analysis

Code

At first, read the data and check it. With Julia, we can basically do those with same manner as Python.

using DataFrames
using GLM

# load data
dataOrig = readtable("../data/UCI_Credit_Card.csv", header=true)

println("data size: "*string(nrow(dataOrig)))
println("default size: "*string(nrow(dataOrig[dataOrig[:default_payment_next_month].==1,:])))

data size: 30000
default size: 6636

About one-hot-encoding of categorical data and AUC score, I made the functions. Actually, there are already packages to cover those as followings.

Here, with some reasons, I don’t use those and wrote by myself. It is preferable to use existing packages in practical situation. But, this time, don’t care. You can use the functions from my GitHub repository.

https://github.com/marubontan/MLUtils.jl

The code below is to pick up the categorical columns and omit-target column. Also, it does one-hot-encoding. These are pre-processings.

include("../MLUtils.jl/src/utils.jl")

# omit-target columns
omitTargetLabel = [:ID]

# categorical columns
payLabel = ["PAY_"*string(i) for i in range(0,7) if i != 1]
categoricalLabel = ["SEX", "EDUCATION", "MARRIAGE"]

append!(categoricalLabel, payLabel)

categoricalLabel = Symbol.(categoricalLabel)

oneHotEncoded = oneHotEncode(dataOrig[categoricalLabel])

Delete the original categorical columns and omit-target column. And attach the one-hot-encoded columns.

delete!(dataOrig, categoricalLabel)
delete!(dataOrig, omitTargetLabel)

data = hcat(dataOrig, oneHotEncoded)

With splitobs() from MLDataUtils, we can split data into train and test data sets easily in almost same manner as with train_test_split() from sklearn in Python.

using MLDataUtils

trainData, testData = splitobs(shuffleobs(data), at=0.7)

By picking one specific explaining variable, I'll make a model.

# by one specific variable

logit = glm(@formula(default_payment_next_month ~ BILL_AMT4), trainData, Binomial(), LogitLink())

Predict the test data and check AUC score.

predictedScore = predict(logit, testData[[:BILL_AMT4]])
auc(Int.(testData[:default_payment_next_month]), Float64.(predictedScore))

0.5225570248314823

Do simple univariate analysis. The part of the formula looks bit confusing. When we use variable inside the formula, we need to evaluate it with @eval. About this part, you can read the official page below.

https://juliastats.github.io/StatsModels.jl/latest/formula.html


explainingLabels = names(trainData)
aucOutcomes = Dict()
for label in explainingLabels

    if label == :default_payment_next_month
        continue
    end
    formula =@eval @formula(default_payment_next_month ~ $label)
    logit = glm(formula, trainData, Binomial(), LogitLink())

    predictedScore = predict(logit, testData[[label]])
    aucOutcomes[string(label)] = auc(Int.(testData[:default_payment_next_month]), Float64.(predictedScore))
end

With PyPlot, we can visualize the AUC scores.

using PyPlot

labels = []
scores = []
for item in sort(collect(aucOutcomes), by=x->x[2], rev=true)
    push!(labels, item[1])
    push!(scores, item[2])
end

bar(labels, scores)