python中的二进制一键式(one-of-K)编码问题 [英] Problems with a binary one-hot (one-of-K) coding in python

查看：512 发布时间：2020/5/24 1:43:40 python pandas scikit-learn categorical-data

本文介绍了python中的二进制一键式(one-of-K)编码问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

二进制一热(也称为K之一)编码在于为分类变量的每个不同值创建一个二进制列.例如，如果一个颜色列(类别变量)的值是"red"，"blue"，"yellow"和"unknown"，那么二进制一键编码会用二进制列"color =红色"，颜色=蓝色"和颜色=黄色".我从熊猫数据框中的数据开始，我想使用此数据来通过scikit-learn训练模型.我知道执行二进制一键式编码的两种方法，但都不令我满意.

Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categorical variable) that takes the values 'red', 'blue', 'yellow', and 'unknown' then a binary one-hot coding replaces the color column with binaries columns 'color=red', 'color=blue', and 'color=yellow'. I begin with data in a pandas data-frame and I want to use this data to train a model with scikit-learn. I know two ways to do the binary one-hot coding, none of them satisfactory to me.

Pandas和get_dummies .就原始数据框包含可用的所有数据而言，此方法似乎很好.也就是说，在将数据拆分为训练，验证和测试集之前，请先进行一站式编码.但是，如果数据已经分成不同的集合，则此方法将无法很好地工作.为什么?因为其中一个数据集(例如测试集)可以包含较少的给定变量值.例如，可能发生的情况是，训练集中包含红色，蓝色，黄色和可变颜色未知值，而测试集中仅包含红色和蓝色.因此，测试集的最终列数将少于训练集的列数. (我也不知道新列的排序方式，即使有相同的列，也可能在每个集合中以不同的顺序排列.)

Pandas and get_dummies in the categorical columns of the data-frame. This method seems excellent as far as the original data-frame contains all data available. That is, you do the one-hot coding before splitting your data in training, validation, and test sets. However, if the data is already split in different sets, this method doesn't work very well. Why? Because one of the data sets (say, the test set) can contain fewer values for a given variable. For example, it can happen that whereas the training set contain the values red, blue, yellow, and unknown for the variable color, the test set only contains red and blue. So the test set would end up having fewer columns than the training set. (I don't know either how the new columns are sorted, and if even having the same columns, this could be in a different order in each set).

Sklearn和DictVectorizer 这解决了前一个问题，因为我们可以确保对测试集应用完全相同的转换.但是，转换的结果是一个numpy数组，而不是熊猫数据框.如果我们想将输出恢复为熊猫数据帧，则需要(或者至少是我这样做的方式):1)pandas.DataFrame(data = DictVectorizer转换的结果，index =原始熊猫数据的索引frame，columns = DictVectorizer().get_features_names)和2)沿索引将结果数据帧与原始数据帧合并在一起，其中原始数据帧包含数字列.可以，但是有点麻烦.

Sklearn and DictVectorizer This solves the previous issue, as we can make sure that we are applying the very same transformation to the test set. However, the outcome of the transformation is a numpy array instead of a pandas data-frame. If we want to recover the output as a pandas data-frame, we need to (or at least this is the way I do it): 1) pandas.DataFrame(data=outcome of DictVectorizer transformation, index=index of original pandas data frame, columns= DictVectorizer().get_features_names) and 2) join along the index the resulting data-frame with the original one containing the numerical columns. This works, but it is somewhat cumbersome.

如果我们在训练和测试集中拆分了数据，是否有更好的方法在熊猫数据框中进行二进制一键编码?

python中的二进制一键式(one-of-K)编码问题 [英] Problems with a binary one-hot (one-of-K) coding in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

python中的二进制一键式(one-of-K)编码问题 [英] Problems with a binary one-hot (one-of-K) coding in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭