Skip to content Skip to sidebar Skip to footer

What Is Feature Hashing (hashing-trick)?

I know feature hashing (hashing-trick) is used to reduce the dimensionality and handle sparsity of bit vectors but I don't understand how it really works. Can anyone explain this t

Solution 1:

On Pandas, you could use something like this:

import pandas as pd
import numpy as np

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

data = pd.DataFrame(data)

defhash_col(df, col, N):
    cols = [col + "_" + str(i) for i inrange(N)]
    defxform(x): tmp = [0for i inrange(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
    df[cols] = df[col].apply(xform)
    return df.drop(col,axis=1)

print hash_col(data, 'state',4)

The output would be

popyearstate_0state_1state_2state_301.52000        010011.72001        010023.62002        010032.42001        000142.92002        0001

Also on Series level, you could

import numpy as np, os import sys, pandas as pd

def hash_col(df, col, N):
    df = df.replace('',np.nan)
    cols = [col + "_" + str(i) for i in range(N)]
    tmp = [0 for i in range(N)]
    tmp[hash(df.ix[col]) % N] = 1
    res = df.append(pd.Series(tmp,index=cols))
    return res.drop(col)

a = pd.Series(['new york',30,''],index=['city','age','test'])
b = pd.Series(['boston',30,''],index=['city','age','test'])

print hash_col(a,'city',10)
print hash_col(b,'city',10)

This will work per single Series, column name will be assumed to be a Pandas index. It also replaces blank strings with nan, and floats everything.

age        30
test      NaN
city_0      0
city_1      0
city_2      0
city_3      0
city_4      0
city_5      0
city_6      0
city_7      1
city_8      0
city_9      0dtype:object
age        30
test      NaN
city_0      0
city_1      0
city_2      0
city_3      0
city_4      0
city_5      1
city_6      0
city_7      0
city_8      0
city_9      0dtype:object

If, however, there is a vocabulary, and you simply want to one-hot-encode, you could use

import numpy as np
import pandas as pd, os
import scipy.sparse as sps

defhash_col(df, col, vocab):
    cols = [col + "=" + str(v) for v in vocab]
    defxform(x): tmp = [0for i inrange(len(vocab))]; tmp[vocab.index(x)] = 1; return pd.Series(tmp,index=cols)
    df[cols] = df[col].apply(xform)
    return df.drop(col,axis=1)

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

df = pd.DataFrame(data)

df2 = hash_col(df, 'state', ['Ohio','Nevada'])

print sps.csr_matrix(df2)

which will give

popyearstate=Ohiostate=Nevada01.52000           1011.72001           1023.62002           1032.42001           0142.92002           01

I also added sparsification of the final dataframe as well. In incremental setting where we might not have encountered all values beforehand (but we somehow obtained the list of all possible values somehow), the approach above can be used. Incremental ML methods would need the same number of features at each increment, hence one-hot encoding must produce the same number of rows at each batch.

Solution 2:

Here (sorry I cannot add this as a comment for some reason.) Also, the first page of Feature Hashing for Large Scale Multitask Learning explains it nicely.

Solution 3:

Large sparse feature can be derivate from interaction, U as user and X as email, so the dimension of U x X is memory intensive. Usually, task like spam filtering has time limitation as well.

Hash trick like other hash function store binary bits (index) which make large scale training feasible. In theory, more hashed length more performance gain, as illustrated in the original paper.

It allocate origin feature into different bucket (finite length of feature space) so that their semantic get kept. Even when spammer use typo to miss on the radar. Although there is distortion error, heir hashed form remain close.

For example,

"the quick brown fox" transform to:

h(the) mod5 = 0

h(quick) mod5 = 1

h(brown) mod5 = 1

h(fox) mod5 = 3

Use index rather then text value, saves space.

To summarize some of the applications:

  • dimensionality reduction for high dimension feature vector

    • text in email classification task, collaborate filtering on spam
  • sparsification

  • bag-of-words on the fly

  • cross-product features

  • multi-task learning

Reference:

  • Origin paper:

    1. Feature Hashing for Large Scale Multitask Learning

    2. Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A., & Vishwanathan, V. (2009). Hash kernels

  • What is the hashing trick

  • Quora

  • Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing

Implementation:

  • Langford, J., Li, L., & Strehl, A. (2007). Vow- pal wabbit online learning project (Technical Report). http://hunch.net/?p=309.

Post a Comment for "What Is Feature Hashing (hashing-trick)?"