将新列追加到dask数据框 [英] Appending new column to dask dataframe
问题描述
这是对在dash中进行数据重排的后续问题。 / p>
我有一个现有的dask数据框 df
,我希望执行以下操作:
df ['rand_index'] = np.random.permutation(len(df))
但是,这会产生错误,列分配不支持ndarray类型
。我尝试使用 df.assign(rand_index = np.random.permutation(len(df))
会给出相同的错误。
这里是一个最小的(不是)工作示例:
进口熊猫为pd
进口黄昏。 dataframe as dd
import numpy as np
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3] * 10,'B':[3 ,2,1] * 10}),npartitions = 10)
df ['rand_index'] = np.random.permutation(len(df))
注意:
前面提到的问题使用 df = df.map_partitions(add_random_column_to_pandas_dataframe ,...)
,但我不确定这是否与该特定情况相关。
编辑1
我尝试了
df ['rand_index'] = dd.from_array(np.random.permutation(len_df))
,执行时没有问题。当我检查 df.head()
时,似乎已经创建了新列。但是,当我查看 df时, .tail()
rand_index
是一堆 NaN
s。
实际上只是为了确认我检查了 df .rand_index.max()。compute()
小于 len(df)-1
。因此,这可能是 df.map_partitions
发挥作用的地方,因为我怀疑这是dask被分区的问题。在我的特定情况下,我有80个分区(不涉及示例情况)。
您需要打开 np.random.permutation(len(df))
转换为dask可以理解的类型:
permutations = dd.from_array(np.random.permutation(len(df) ))
df ['rand_index'] =排列
df
收益:
Dask DataFrame结构:
AB rand_index
npartitions = 10
0 int64 int64 int32
3 ... ...
... ... ...
27 ... ... ...
29 ... ... ...
达斯克(Dask)名称:分配,61个任务
如果要 .compute()
计算实际结果,该由您自己决定。
This is a follow up question to Shuffling data in dask.
I have an existing dask dataframe df
where I wish to do the following:
df['rand_index'] = np.random.permutation(len(df))
However, this gives the error, Column assignment doesn't support type ndarray
. I tried to use df.assign(rand_index = np.random.permutation(len(df))
which gives the same error.
Here is a minimal (not) working sample:
import pandas as pd
import dask.dataframe as dd
import numpy as np
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*10, 'B':[3,2,1]*10}), npartitions=10)
df['rand_index'] = np.random.permutation(len(df))
Note:
The previous question mentioned using df = df.map_partitions(add_random_column_to_pandas_dataframe, ...)
but I'm not sure if that is relevant to this particular case.
Edit 1
I attempted
df['rand_index'] = dd.from_array(np.random.permutation(len_df))
which, executed without an issue. When I inspected df.head()
it seems that the new column was created just fine. However, when I look at df.tail()
the rand_index
is a bunch of NaN
s.
In fact just to confirm I checked df.rand_index.max().compute()
which turned out to be smaller than len(df)-1
. So this is probably where df.map_partitions
comes into play as I suspect this is an issue with dask being partitioned. In my particular case I have 80 partitions (not referring to the sample case).
You would need to turn np.random.permutation(len(df))
into type that dask understands:
permutations = dd.from_array(np.random.permutation(len(df)))
df['rand_index'] = permutations
df
This would yield:
Dask DataFrame Structure:
A B rand_index
npartitions=10
0 int64 int64 int32
3 ... ... ...
... ... ... ...
27 ... ... ...
29 ... ... ...
Dask Name: assign, 61 tasks
So it is up to you now if you want to .compute()
to calculate actual results.
这篇关于将新列追加到dask数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!