python中的二进制一键式(one-of-K)编码问题 [英] Problems with a binary one-hot (one-of-K) coding in python

查看:512
本文介绍了python中的二进制一键式(one-of-K)编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

二进制一热(也称为K之一)编码在于为分类变量的每个不同值创建一个二进制列.例如,如果一个颜色列(类别变量)的值是"red","blue","yellow"和"unknown",那么二进制一键编码会用二进制列"color =红色",颜色=蓝色"和颜色=黄色".我从熊猫数据框中的数据开始,我想使用此数据来通过scikit-learn训练模型.我知道执行二进制一键式编码的两种方法,但都不令我满意.

Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me.

    数据框类别列中的
  1. Pandas和get_dummies .就原始数据框包含可用的所有数据而言,此方法似乎很好.也就是说,在将数据拆分为训练,验证和测试集之前,请先进行一站式编码.但是,如果数据已经分成不同的集合,则此方法将无法很好地工作.为什么?因为其中一个数据集(例如测试集)可以包含较少的给定变量值.例如,可能发生的情况是,训练集中包含红色,蓝色,黄色和可变颜色未知值,而测试集中仅包含红色和蓝色.因此,测试集的最终列数将少于训练集的列数. (我也不知道新列的排序方式,即使有相同的列,也可能在每个集合中以不同的顺序排列.)

  1. Pandas and get_dummies in the categorical columns of the data-frame. This method seems excellent as far as the original data-frame contains all data available. That is, you do the one-hot coding before splitting your data in training, validation, and test sets. However, if the data is already split in different sets, this method doesn't work very well. Why? Because one of the data sets (say, the test set) can contain fewer values for a given variable. For example, it can happen that whereas the training set contain the values red, blue, yellow, and unknown for the variable color, the test set only contains red and blue. So the test set would end up having fewer columns than the training set. (I don't know either how the new columns are sorted, and if even having the same columns, this could be in a different order in each set).

Sklearn和DictVectorizer 这解决了前一个问题,因为我们可以确保对测试集应用完全相同的转换.但是,转换的结果是一个numpy数组,而不是熊猫数据框.如果我们想将输出恢复为熊猫数据帧,则需要(或者至少是我这样做的方式):1)pandas.DataFrame(data = DictVectorizer转换的结果,index =原始熊猫数据的索引frame,columns = DictVectorizer().get_features_names)和2)沿索引将结果数据帧与原始数据帧合并在一起,其中原始数据帧包含数字列.可以,但是有点麻烦.

Sklearn and DictVectorizer This solves the previous issue, as we can make sure that we are applying the very same transformation to the test set. However, the outcome of the transformation is a numpy array instead of a pandas data-frame. If we want to recover the output as a pandas data-frame, we need to (or at least this is the way I do it): 1) pandas.DataFrame(data=outcome of DictVectorizer transformation, index=index of original pandas data frame, columns= DictVectorizer().get_features_names) and 2) join along the index the resulting data-frame with the original one containing the numerical columns. This works, but it is somewhat cumbersome.

如果我们在训练和测试集中拆分了数据,是否有更好的方法在熊猫数据框中进行二进制一键编码?

推荐答案

如果列的顺序相同,则可以将dfs连接起来,使用get_dummies,然后再次将它们拆分回去,例如,

If your columns are in the same order, you can concatenate the dfs, use get_dummies, and then split them back again, e.g.,

encoded = pd.get_dummies(pd.concat([train,test], axis=0))
train_rows = train.shape[0]
train_encoded = encoded.iloc[:train_rows, :]
test_encoded = encoded.iloc[train_rows:, :] 

如果各列的排列顺序不同,那么无论尝试哪种方法,您都将面临挑战.

If your columns are not in the same order, then you'll have challenges regardless of what method you try.

这篇关于python中的二进制一键式(one-of-K)编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆