将新列追加到dask数据框 [英] Appending new column to dask dataframe

查看:144
本文介绍了将新列追加到dask数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是对在dash中进行数据重排的后续问题。 / p>

我有一个现有的dask数据框 df ,我希望执行以下操作:

  df ['rand_index'] = np.random.permutation(len(df))

但是,这会产生错误,列分配不支持ndarray类型。我尝试使用 df.assign(rand_index = np.random.permutation(len(df))会给出相同的错误。



这里是一个最小的(不是)工作示例:

 进口熊猫为pd 
进口黄昏。 dataframe as dd
import numpy as np

df = dd.from_pandas(pd.DataFrame({'A':[1,2,3] * 10,'B':[3 ,2,1] * 10}),npartitions = 10)
df ['rand_index'] = np.random.permutation(len(df))



注意:



前面提到的问题使用 df = df.map_partitions(add_random_column_to_pandas_dataframe ,...),但我不确定这是否与该特定情况相关。



编辑1



我尝试了
df ['rand_index'] = dd.from_array(np.random.permutation(len_df)),执行时没有问题。当我检查 df.head()时,似乎已经创建了新列。但是,当我查看 df时, .tail() rand_index 是一堆 NaN s。



实际上只是为了确认我检查了 df .rand_index.max()。compute()小于 len(df)-1 。因此,这可能是 df.map_partitions 发挥作用的地方,因为我怀疑这是dask被分区的问题。在我的特定情况下,我有80个分区(不涉及示例情况)。

解决方案

您需要打开 np.random.permutation(len(df))转换为dask可以理解的类型:

  permutations = dd.from_array(np.random.permutation(len(df) ))
df ['rand_index'] =排列
df

收益:

  Dask DataFrame结构:
AB rand_index
npartitions = 10
0 int64 int64 int32
3 ... ...
... ... ...
27 ... ... ...
29 ... ... ...
达斯克(Dask)名称:分配,61个任务

如果要 .compute()计算实际结果,该由您自己决定。


This is a follow up question to Shuffling data in dask.

I have an existing dask dataframe df where I wish to do the following:

df['rand_index'] = np.random.permutation(len(df))

However, this gives the error, Column assignment doesn't support type ndarray. I tried to use df.assign(rand_index = np.random.permutation(len(df)) which gives the same error.

Here is a minimal (not) working sample:

import pandas as pd
import dask.dataframe as dd
import numpy as np

df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*10, 'B':[3,2,1]*10}), npartitions=10)
df['rand_index'] = np.random.permutation(len(df))

Note:

The previous question mentioned using df = df.map_partitions(add_random_column_to_pandas_dataframe, ...) but I'm not sure if that is relevant to this particular case.

Edit 1

I attempted df['rand_index'] = dd.from_array(np.random.permutation(len_df)) which, executed without an issue. When I inspected df.head() it seems that the new column was created just fine. However, when I look at df.tail() the rand_index is a bunch of NaNs.

In fact just to confirm I checked df.rand_index.max().compute() which turned out to be smaller than len(df)-1. So this is probably where df.map_partitions comes into play as I suspect this is an issue with dask being partitioned. In my particular case I have 80 partitions (not referring to the sample case).

解决方案

You would need to turn np.random.permutation(len(df)) into type that dask understands:

permutations = dd.from_array(np.random.permutation(len(df)))
df['rand_index'] = permutations
df

This would yield:

Dask DataFrame Structure:
                    A      B rand_index
npartitions=10                         
0               int64  int64      int32
3                 ...    ...        ...
...               ...    ...        ...
27                ...    ...        ...
29                ...    ...        ...
Dask Name: assign, 61 tasks

So it is up to you now if you want to .compute() to calculate actual results.

这篇关于将新列追加到dask数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆