机器学习技巧之feature_hashing

问题

最近在玩kaggle上的ctr比赛，其训练数据含大量categorical，无法直接用LR模型。举个例子，某个categorical数据集含[苹果，西瓜，梨，桃子]四个类别，一般的处理方法是将这些类别映射成[0,1,2,3]，放入模型中训练。其实这是不合理的，在categorical中，桃子和西瓜并不存在等级差，而变成[1,3]后会存在3>1的问题。以Logistic Regression为代表的算法就无法对该特征学到合适的参数。因此，业界一般会对categorical数据集做onehotencoding，即向量化，还是以上面数据为例子，苹果对应的向量为[1,0,0,0]，桃子对应的为[0,0,0,1]。在sklearn中，可以通过OneHotEncoding或get_dummies实现。显而易见，数据会变得非常稀疏。同时，当categorical的类别变多，特征维度随之剧增，带来的内存存储问题。比如在这次的ctr中，如果采用OneHotEncoding，我60g内存的机器也会报Memory error。

再次，ctr领域或者说高维大数据领域，数据集或多或少的存在稀疏问题。主流ML库都会实现一套稀疏矩阵，应对该问题。feature hashing又称feature trick，类似于kernel trick，在ML领域得到广泛应用的技巧。维基上的定义：

In machine learning, feature hashing, also known as the hashing trick[1] (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array

解决方案

维基上关于Feature Hash的描述非常清晰，各位自己去看，不再累赘，这里多说一点hash的方法。常见的有以下两种实现：

function hashing_vectorizer(features : array of string, N : integer):
 x := new vector[N]
 for f in features:
     h := hash(f)
     x[h mod N] += 1
 return x

另外还有一种：

function hashing_vectorizer(features : array of string, N : integer):
 x := new vector[N]
 for f in features:
     h := hash(f)
     idx := h mod N
     if ξ(f) == 1:
         x[idx] += 1
     else:
         x[idx] -= 1
 return x

可以理解，既然是hash，必然要付出collision时的代价。实现方案一并没有考虑处理冲突，N越长，冲突的概率越低，然后存储的要求会变大。实现二，通过有符号的hash来解决冲突问题，即有很大概率在出现冲突时，该hash值为0，即不起作用，更详细的描述参考文献2.

sklearn FeatureHasher的实现

class sklearn.feature_extraction.FeatureHasher(n_features=1048576, input_type='dict', dtype=<type 'numpy.float64'>, non_negative=False)，该接口返回一个sparse类型的array。

The hash function employed is the signed 32-bit version of Murmurhash3.

该接口需要注意的是数据入参，支持三种格式：pair、dict和string。可以参考官方的featurehasher test。

stackoverflow上也有一个比较好的例子：

I am using FeatureHasher in scikit-learn.

Can anyone explain why I end up with 4 non zero data in the sparse matrix instead of 2 after the following:

>>> f = FeatureHasher(input_type='string')
>>> g = f.transform(('as','bs'))
<2x1048576 sparse matrix of type '<type 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
>>> g=f.transform(('as','bs'))
>>> g.data
array([-1.,  1., -1., -1.])
>>> g.nonzero()
(array([0, 0, 1, 1], dtype=int32), array([341263, 354738,  98813, 341263], dtype=int32))

It appears is expecting a sequence of sequences. The outer sequence being for the observations, and the inner being features. With your input, the inner sequence are the characters of the string.

Observation 0: 'a' -> 354738, 's' -> 341263

Observation 1: 'b' -> 98813, 's' -> 341263

Try this:

g = f.transform([['as'],['bs']])
For output:

>>> g.nonzero()
(array([0, 1], dtype=int32), array([494108, 335425], dtype=int32))