Basic classification example

Overview

I make classification model of free wine data, following how to deal with it step by step.

Get data

import pandas as pd
import numpy as np
wine_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
header = None)
wine_data.columns = ['Class label', 'Alcohol', 
    'Malic acid', 'Ash', 
    'Alcalinity of ash', 'Magnesium', 
    'Total phenols', 'Flavanoids', 
    'Nonflavanoid phenols', 'Proanthocyanins', 
    'Color intensity', 'Hue', 
    'OD280/OD315 of diluted wines', 'Proline']

Data

The data got above has 178 rows and 14 columns. ‘Class label’ is explained variable and other columns mean explaining variable to predict the label.
By the function, we can check the summary of it.

wine_data.describe()

Dealing with data

At first, chcking null on data is important.
If it has, it is necessary to deal with it by erasing the columns, rows, or by padding it.

wine_data.isnull().sum()

The outcome is like this.

wine_data['Class label'].value_counts()


2    71
1    59
3    48
dtype: int64

The label appearances is inbalanced but not so much.
Import library.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

This time, I use logistic regression.
The number of data is 178, meaning this is not so few but not many, so I don’t use hold-out way. I use k-splitted cross-validation.
As the step, normalize the data, make model by logistic regression, evaluate by k-splitted cross-vaidation.
‘sklearn”s pipeline is awsome. It pack the preprocessing(this time, the normalization) and modeling.

pipe_lr = Pipeline([
        ('sc', StandardScaler()),
        ('lr', LogisticRegression())
    ])
scores = cross_val_score(estimator=pipe_lr,
                        X=wine_data.ix[:,1:],
                        y=wine_data.ix[:,0],
                        cv = 10,
                        n_jobs = 1)

By this, the accuracy of each cross-validation is assigned on the list.

print scores
[ 1.          0.94444444  1.          0.94444444  1.          0.94444444
  1.          1.          1.          1.        ]

The average of the accuracy is.

print np.mean(scores)
0.983333333333

The accuracy of this model(normalize the data and do logistic regression) is more or less 0.98.
In practical situation, we need at first to do some preprocessings and make some models and after comparing with each other, we adapt the model.
*1
If the data has null, we need to deal with it by erasing row, columns or padding by representative values.

# delete null-contained rows
wine_data.dropna()

# delete null-contained columns
wine_data.dropna(axis=1)

# pad null with representative values
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imp.fit(wine_data)
data = imp.transform(wine_data.values)

The padding can choose the kind of representative way and row-oriented or column-oriented.
*2
If the data has categorical variables, we can make those dummied. TH pandas’s get_dummies function is useful.

import pandas as pd

data = pd.Series(['red', 'blue', 'red', 'green'])

data
0      red
1     blue
2      red
3    green
dtype: object

pd.get_dummies(data)

blue	green	red
0	0	0
1	1	0
2	0	0
3	0	1

Data Science Struggle

Friday, June 16, 2017