对数值类型的特征进行处理的方法,从而得到进行数据建模期望的特征值。
import numpy as np
import sklearn.preprocessing as prpr
1 二值法
1.1 noraml
X = np.array([[1,-1,2],
[2,0,0],
[0,1,-1]
])
X[X<=1]=0
X[X>1]=1
X
array([[0, 0, 1],
[1, 0, 0],
[0, 0, 0]])
1.2 Binarizer
Binarizer需要指定一个阈值threshold,当大于这个阈值时为1,否则为0
X = np.array([[1.0,-1,2],
[2,0,0],
[0,1.0,-1]
])
binarizer = prpr.Binarizer(threshold=1.0)
Xa = binarizer.transform(X)
Xa
array([[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 0., 0.]])
2 取整(rounding)
X = np.array([[1.2,-1,2],
[2,0,0],
[0.9,1.9,-1]
])
Xb = np.round(X[:,0])
Xb
array([ 1., 2., 1.])
# 可以发现虽然去掉了小数点,但数组的元素类型依然为浮点型
np.array(Xb,dtype='int')
array([1, 2, 1])
3 iteraction
X = np.arange(6).reshape(3,2)
X
array([[0, 1],
[2, 3],
[4, 5]])
polyfeatures = prpr.PolynomialFeatures(degree=2)
Xa = polyfeatures.fit_transform(X)
Xa
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
PolynomialFeatures(degree=2,interaction_only=False, include_bias=False),生成的结果为 1, a,b,a^2,ab,b^2
4 Binning(quantization)量化
使用场景:
- 原始数据发生倾斜:即有些数据发生频率很高,有些却极少发生
- 倾斜的数据很容易造成建模时发生问题,例如,梯度下降有时会很慢
方法:
- fixed-width binning
- adaptive binning
4.1 fixed-width binning
import pandas as pd
fcc = pd.read_csv('datasets/2016-FCC-New-Coders-Survey-Data.csv',encoding='utf-8')
C:\Users\liuwu\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (21,57) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
fcc[['ID.x', 'EmploymentField', 'Age', 'Income']].head()
ID.x | EmploymentField | Age | Income | |
---|---|---|---|---|
0 | cef35615d61b202f1dc794ef2746df14 | office and administrative support | 28.0 | 32000.0 |
1 | 323e5a113644d18185c743c241407754 | food and beverage | 22.0 | 15000.0 |
2 | b29a1027e5cd062e654a63764157461d | finance | 19.0 | 48000.0 |
3 | 04a11e4bcb573a1261eb0d9948d32637 | arts, entertainment, sports, or media | 26.0 | 43000.0 |
4 | 9368291c93d5d5f5c8cdb1a575e18bec | education | 20.0 | 6000.0 |
visualize the data
import matplotlib.pyplot as plt
%matplotlib inline
fig,ax = plt.subplots()
fcc['Age'].hist()
ax.set_title('developer Age histogram')
ax.set_xlabel('Age',fontsize=12)
ax.set_ylabel('frequency',fontsize=12)
Text(0,0.5,'frequency')
- 按照宽度均等的要求将数据放入不同的bin中
fcc['Age_bin_round'] = np.floor(fcc['Age']/ 10)
fcc[['ID.x', 'EmploymentField', 'Age', 'Income','Age_bin_round']].head()
ID.x | EmploymentField | Age | Income | Age_bin_round | |
---|---|---|---|---|---|
0 | cef35615d61b202f1dc794ef2746df14 | office and administrative support | 28.0 | 32000.0 | 2.0 |
1 | 323e5a113644d18185c743c241407754 | food and beverage | 22.0 | 15000.0 | 2.0 |
2 | b29a1027e5cd062e654a63764157461d | finance | 19.0 | 48000.0 | 1.0 |
3 | 04a11e4bcb573a1261eb0d9948d32637 | arts, entertainment, sports, or media | 26.0 | 43000.0 | 2.0 |
4 | 9368291c93d5d5f5c8cdb1a575e18bec | education | 20.0 | 6000.0 | 2.0 |
- 按照不同的宽度要求将数据放入到不同的bin中
bin_range=[0,15,30,45,60,70,100]
bin_label=[1,2,3,4,5,6]
fcc['Age_bin_range'] = pd.cut(fcc['Age'],bins=bin_range)
fcc['Age_bin_label'] = pd.cut(fcc['Age'],bins=bin_range,labels=bin_label)
fcc[['Age','Age_bin_round','Age_bin_range','Age_bin_label']].head(10)
Age | Age_bin_round | Age_bin_range | Age_bin_label | |
---|---|---|---|---|
0 | 28.0 | 2.0 | (15, 30] | 2 |
1 | 22.0 | 2.0 | (15, 30] | 2 |
2 | 19.0 | 1.0 | (15, 30] | 2 |
3 | 26.0 | 2.0 | (15, 30] | 2 |
4 | 20.0 | 2.0 | (15, 30] | 2 |
5 | 34.0 | 3.0 | (30, 45] | 3 |
6 | 23.0 | 2.0 | (15, 30] | 2 |
7 | 35.0 | 3.0 | (30, 45] | 3 |
8 | 33.0 | 3.0 | (30, 45] | 3 |
9 | 33.0 | 3.0 | (30, 45] | 3 |
fcc[['Age','Age_bin_round','Age_bin_range','Age_bin_label']].iloc[1071:1076]
Age | Age_bin_round | Age_bin_range | Age_bin_label | |
---|---|---|---|---|
1071 | 22.0 | 2.0 | (15, 30] | 2 |
1072 | 21.0 | 2.0 | (15, 30] | 2 |
1073 | 40.0 | 4.0 | (30, 45] | 3 |
1074 | 34.0 | 3.0 | (30, 45] | 3 |
1075 | 29.0 | 2.0 | (15, 30] | 2 |
4.2 adaptive binning
fix-width binning会导致有些bin会装很多数据,而有些bin只有少数数据甚至为空,依然无法解决数据偏斜的问题, adaptive binning是一种比fix-width更好更安全的方法,可以根据数据本身的分布将数据分到不同的bin中。
方法:
- 二分位
- 四分位
- 十分位
可视化数据
fig,ax = plt.subplots()
fcc['Income'].hist()
ax.set_title('developer income')
ax.set_xlabel('income')
ax.set_ylabel('freqency')
Text(0,0.5,'freqency')
quantile_list=[0,0.25,0.5,0.75,1.0]
quat= fcc['Income'].quantile(quantile_list)
quat
0.00 6000.0
0.25 20000.0
0.50 37000.0
0.75 60000.0
1.00 200000.0
Name: Income, dtype: float64
# 可视化数据
fig,ax = plt.subplots()
fcc['Income'].hist()
for q in quat:
plv = plt.axvline(q,color='r')
ax.legend([plv],['Quantiles'],fontsize=20)
q_labels = ['0-25Q','25-50Q','50-75Q','75-100Q']
fcc['Income_q_range'] = pd.qcut(fcc['Income'],q=quantile_list)
fcc['Income_q_label'] = pd.qcut(fcc['Income'],q=quantile_list,labels=q_labels)
fcc[['Income','Income_q_range','Income_q_label']].head()
Income | Income_q_range | Income_q_label | |
---|---|---|---|
0 | 32000.0 | (20000.0, 37000.0] | 25-50Q |
1 | 15000.0 | (5999.999, 20000.0] | 0-25Q |
2 | 48000.0 | (37000.0, 60000.0] | 50-75Q |
3 | 43000.0 | (37000.0, 60000.0] | 50-75Q |
4 | 6000.0 | (5999.999, 20000.0] | 0-25Q |
5 Statistical Transformations
用于将数值型的特征的分布转化为尽量贴近正态分布(normal distribution)的特征。
5.1 Log Transform
fcc['Income_log'] = np.log(1 + fcc['Income'])
fcc_mean = np.round(np.mean(fcc['Income_log']),2)
fig, ax = plt.subplots()
fcc['Income_log'].hist()
plt.axvline(fcc_mean,color='red')
ax.set_xlabel('Income_log scale')
ax.set_ylabel('frequecy')
Text(0,0.5,'frequecy')
As we can see from the above figure, it is nearly close to the normal distribution but we can do much better.
let ‘s see how to do this with box-cox
5.2 box-cox transform
限制条件:
-
输入数字必须为正数;如果含有负数,使用常量lamda: $\lambda$ 将其转为正数如下:
-
如果$\lambda = 0 $,则为log transform
income = np.array(fcc['Income'])
income.shape
(15620,)
income_clean = income[~np.isnan(income)]
import scipy.stats as spstats
# 获取最优的lambda
l,opt_lambda = spstats.boxcox(income_clean)
print('optical lambda value is:',opt_lambda)
optical lambda value is: 0.117991226621
fcc['Income_lambda_0'] = spstats.boxcox((1+fcc['Income']),lmbda=0)
fcc['Income_lambda_opt'] = spstats.boxcox(fcc['Income'],lmbda=opt_lambda)
C:\Users\liuwu\Anaconda3\lib\site-packages\scipy\stats\morestats.py:1030: RuntimeWarning: invalid value encountered in less_equal
if any(x <= 0):
visualization the data
fig,ax = plt.subplots()
plt.axvline(np.round(np.mean(fcc['Income_lambda_opt']),2),color='red')
fcc['Income_lambda_opt'].hist()
ax.set_xlabel('income opt lambda')
ax.set_ylabel('frequency')
Text(0,0.5,'frequency')