On Python, by scikit-learn, we can do it.
I'll use air quality data to try it.
To prepare the data, on R console, execute the following code on your working directory.
write.csv(airquality, "airquality.csv", row.names=FALSE)
On Python, let’s try to complement the missing values with the representative values.
import pandas as pd
airquality = pd.read_csv('airquality.csv')
print(airquality.head())
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
As you can see, the air quality data has the missing values. To precisely check the existence of missing values, we can use isnull() method.
print(airquality.isnull().any())
Ozone True
Solar.R True
Wind False
Temp False
Month False
Day False
dtype: bool
The columns, Ozone and Solar.R, have missing values.
To focus on the missing value dealing, I’ll limit the columns.
data = airquality[['Ozone', 'Solar.R']]
print(data.head())
Ozone Solar.R
0 41.0 190.0
1 36.0 118.0
2 12.0 149.0
3 18.0 313.0
4 NaN NaN
The Imputer class of scikit-learn works well for complements.
The code below is one of the examples for complements.
On this case, I complemented the missing values by the mean of columns equivalent to the missing value’s position.
The fit_transform() method can be separated into fit() and transform(). The role of fit() is to adapt the data and the role of transform() is to execute complements. The fit_transform() method does those at once.
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputed_data = imr.fit_transform(data)
print(imputed_data[:10])
[[ 41. 190. ]
[ 36. 118. ]
[ 12. 149. ]
[ 18. 313. ]
[ 42.12931034 185.93150685]
[ 28. 185.93150685]
[ 23. 299. ]
[ 19. 99. ]
[ 8. 19. ]
[ 42.12931034 194. ]]
The missing values were complemented with mean of the columns.
By changing the strategy parameter, we can choose the imputation strategies such as mean, median.
From the official page, sklearn.preprocessing.Imputer, we can check the detail.