1 二值法
- 1.1 noraml
- 1.2 Binarizer
2 取整(rounding)
3 iteraction
4 Binning（quantization）量化
- 4.1 fixed-width binning
- 4.2 adaptive binning
5 Statistical Transformations
- 5.1 Log Transform
- 5.2 box-cox transform

对数值类型的特征进行处理的方法，从而得到进行数据建模期望的特征值。

import numpy as np
import sklearn.preprocessing as prpr

1 二值法

1.1 noraml

X = np.array([[1,-1,2],
            [2,0,0],
             [0,1,-1]
             ])
X[X<=1]=0
X[X>1]=1

array([[0, 0, 1],
       [1, 0, 0],
       [0, 0, 0]])

1.2 Binarizer

Binarizer需要指定一个阈值threshold,当大于这个阈值时为1，否则为0

X = np.array([[1.0,-1,2],
            [2,0,0],
             [0,1.0,-1]
             ])
binarizer = prpr.Binarizer(threshold=1.0)
Xa = binarizer.transform(X)
Xa

array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

2 取整(rounding)

X = np.array([[1.2,-1,2],
            [2,0,0],
             [0.9,1.9,-1]
             ])
Xb = np.round(X[:,0])
Xb

array([ 1.,  2.,  1.])

# 可以发现虽然去掉了小数点，但数组的元素类型依然为浮点型
np.array(Xb,dtype='int')

array([1, 2, 1])

3 iteraction

X = np.arange(6).reshape(3,2)
X

array([[0, 1],
       [2, 3],
       [4, 5]])

polyfeatures  = prpr.PolynomialFeatures(degree=2)
Xa = polyfeatures.fit_transform(X)

Xa

array([[  1.,   0.,   1.,   0.,   0.,   1.],
       [  1.,   2.,   3.,   4.,   6.,   9.],
       [  1.,   4.,   5.,  16.,  20.,  25.]])

PolynomialFeatures(degree=2,interaction_only=False, include_bias=False),生成的结果为 1, a,b,a^2,ab,b^2

4 Binning（quantization）量化

使用场景：

原始数据发生倾斜：即有些数据发生频率很高，有些却极少发生
倾斜的数据很容易造成建模时发生问题，例如，梯度下降有时会很慢

方法：

fixed-width binning
adaptive binning

4.1 fixed-width binning

import pandas as pd

fcc = pd.read_csv('datasets/2016-FCC-New-Coders-Survey-Data.csv',encoding='utf-8')

C:\Users\liuwu\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2698: DtypeWarning: Columns (21,57) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

fcc[['ID.x', 'EmploymentField', 'Age', 'Income']].head()

	ID.x	EmploymentField	Age	Income
0	cef35615d61b202f1dc794ef2746df14	office and administrative support	28.0	32000.0
1	323e5a113644d18185c743c241407754	food and beverage	22.0	15000.0
2	b29a1027e5cd062e654a63764157461d	finance	19.0	48000.0
3	04a11e4bcb573a1261eb0d9948d32637	arts, entertainment, sports, or media	26.0	43000.0
4	9368291c93d5d5f5c8cdb1a575e18bec	education	20.0	6000.0

visualize the data

import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots()
fcc['Age'].hist()
ax.set_title('developer Age histogram')
ax.set_xlabel('Age',fontsize=12)
ax.set_ylabel('frequency',fontsize=12)

Text(0,0.5,'frequency')

png

按照宽度均等的要求将数据放入不同的bin中

fcc['Age_bin_round'] = np.floor(fcc['Age']/ 10)
fcc[['ID.x', 'EmploymentField', 'Age', 'Income','Age_bin_round']].head()

	ID.x	EmploymentField	Age	Income	Age_bin_round
0	cef35615d61b202f1dc794ef2746df14	office and administrative support	28.0	32000.0	2.0
1	323e5a113644d18185c743c241407754	food and beverage	22.0	15000.0	2.0
2	b29a1027e5cd062e654a63764157461d	finance	19.0	48000.0	1.0
3	04a11e4bcb573a1261eb0d9948d32637	arts, entertainment, sports, or media	26.0	43000.0	2.0
4	9368291c93d5d5f5c8cdb1a575e18bec	education	20.0	6000.0	2.0

按照不同的宽度要求将数据放入到不同的bin中

bin_range=[0,15,30,45,60,70,100]
bin_label=[1,2,3,4,5,6]

fcc['Age_bin_range'] = pd.cut(fcc['Age'],bins=bin_range)

fcc['Age_bin_label'] = pd.cut(fcc['Age'],bins=bin_range,labels=bin_label)

fcc[['Age','Age_bin_round','Age_bin_range','Age_bin_label']].head(10)

	Age	Age_bin_round	Age_bin_range	Age_bin_label
0	28.0	2.0	(15, 30]	2
1	22.0	2.0	(15, 30]	2
2	19.0	1.0	(15, 30]	2
3	26.0	2.0	(15, 30]	2
4	20.0	2.0	(15, 30]	2
5	34.0	3.0	(30, 45]	3
6	23.0	2.0	(15, 30]	2
7	35.0	3.0	(30, 45]	3
8	33.0	3.0	(30, 45]	3
9	33.0	3.0	(30, 45]	3

fcc[['Age','Age_bin_round','Age_bin_range','Age_bin_label']].iloc[1071:1076]

	Age	Age_bin_round	Age_bin_range	Age_bin_label
1071	22.0	2.0	(15, 30]	2
1072	21.0	2.0	(15, 30]	2
1073	40.0	4.0	(30, 45]	3
1074	34.0	3.0	(30, 45]	3
1075	29.0	2.0	(15, 30]	2

4.2 adaptive binning

fix-width binning会导致有些bin会装很多数据，而有些bin只有少数数据甚至为空，依然无法解决数据偏斜的问题， adaptive binning是一种比fix-width更好更安全的方法，可以根据数据本身的分布将数据分到不同的bin中。

方法：

二分位
四分位
十分位

可视化数据

fig,ax = plt.subplots()
fcc['Income'].hist()
ax.set_title('developer income')
ax.set_xlabel('income')
ax.set_ylabel('freqency')

Text(0,0.5,'freqency')

png

quantile_list=[0,0.25,0.5,0.75,1.0]
quat= fcc['Income'].quantile(quantile_list)

quat

00      6000.0
25     20000.0
50     37000.0
75     60000.0
00    200000.0
Name: Income, dtype: float64

# 可视化数据
fig,ax = plt.subplots()
fcc['Income'].hist()
for q in quat:
    plv = plt.axvline(q,color='r')
    ax.legend([plv],['Quantiles'],fontsize=20)

png

q_labels = ['0-25Q','25-50Q','50-75Q','75-100Q']
fcc['Income_q_range'] = pd.qcut(fcc['Income'],q=quantile_list)
fcc['Income_q_label'] = pd.qcut(fcc['Income'],q=quantile_list,labels=q_labels)
fcc[['Income','Income_q_range','Income_q_label']].head()

	Income	Income_q_range	Income_q_label
0	32000.0	(20000.0, 37000.0]	25-50Q
1	15000.0	(5999.999, 20000.0]	0-25Q
2	48000.0	(37000.0, 60000.0]	50-75Q
3	43000.0	(37000.0, 60000.0]	50-75Q
4	6000.0	(5999.999, 20000.0]	0-25Q

5 Statistical Transformations

用于将数值型的特征的分布转化为尽量贴近正态分布(normal distribution)的特征。

5.1 Log Transform

fcc['Income_log'] = np.log(1 + fcc['Income'])
fcc_mean = np.round(np.mean(fcc['Income_log']),2)
fig, ax = plt.subplots()
fcc['Income_log'].hist()
plt.axvline(fcc_mean,color='red')
ax.set_xlabel('Income_log scale')
ax.set_ylabel('frequecy')

Text(0,0.5,'frequecy')

png

As we can see from the above figure, it is nearly close to the normal distribution but we can do much better.

let ‘s see how to do this with box-cox

5.2 box-cox transform

限制条件：

输入数字必须为正数；如果含有负数，使用常量lamda: $\lambda$ 将其转为正数如下： $y = f(x,\lambda) = (x^\lambda - 1) / \lambda$
如果$\lambda = 0 $,则为log transform

income = np.array(fcc['Income'])

income.shape

(15620,)

income_clean = income[~np.isnan(income)]

import scipy.stats as spstats

# 获取最优的lambda
l,opt_lambda = spstats.boxcox(income_clean)
print('optical lambda value is:',opt_lambda)

optical lambda value is: 0.117991226621

fcc['Income_lambda_0'] = spstats.boxcox((1+fcc['Income']),lmbda=0)
fcc['Income_lambda_opt'] = spstats.boxcox(fcc['Income'],lmbda=opt_lambda)

C:\Users\liuwu\Anaconda3\lib\site-packages\scipy\stats\morestats.py:1030: RuntimeWarning: invalid value encountered in less_equal
  if any(x <= 0):

visualization the data

fig,ax = plt.subplots()
plt.axvline(np.round(np.mean(fcc['Income_lambda_opt']),2),color='red')
fcc['Income_lambda_opt'].hist()
ax.set_xlabel('income opt lambda')
ax.set_ylabel('frequency')

Text(0,0.5,'frequency')

png

feature engineering(二) - numeric