将 Pandas 列添加到稀疏矩阵 [英] Adding pandas columns to a sparse matrix

查看:70
本文介绍了将 Pandas 列添加到稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有想要在模型中使用的 X 变量的额外派生值.

I have additional derived values for X variables that I want to use in my model.

XAll = pd_data[['title','wordcount','sumscores','length']]
y = pd_data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(XAll, y, random_state=1)

当我处理标题中的文本数据时,我首先将其单独转换为 dtm:

As I am working with text data in title, I first convert it to a dtm separately:

vect = CountVectorizer(max_df=0.5)
vect.fit(X_train['title'])
X_train_dtm = vect.transform(X_train['title'])
column_index = X_train_dtm.indices

print(type(X_train_dtm))    # This is <class 'scipy.sparse.csr.csr_matrix'>
print("X_train_dtm shape",X_train_dtm.get_shape())  # This is (856, 2016)
print("column index:",column_index)     # This is column index: [ 533  754  859 ...,  633  950 1339]

既然我将文本作为文档术语矩阵,我想将其他特征添加到 X_train_dtm 中,例如wordcount"、sumscores"、length",这些特征是数字.我将使用新的 dtm 创建模型,因此会更准确,因为我会插入附加功能.

Now that I have the text as a document term matrix, I would like to add the other features like 'wordcount','sumscores','length' to X_train_dtm which are numeric. This I shall create the model using the new dtm and thus would be more accurate as I would have inserted additinal features.

如何将 Pandas 数据框的其他数字列添加到稀疏 csr 矩阵中?

How do I add additional numeric columns of the pandas dataframe to a sparse csr matrix?

推荐答案

找到了解决方案.我们可以使用 sparse.hstack 来做到这一点:

Found the solution. We can do this using sparse.hstack:

from scipy.sparse import hstack
X_train_dtm = hstack((X_train_dtm,np.array(X_train['wordcount'])[:,None]))

这篇关于将 Pandas 列添加到稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆