将新列追加到dask数据框 [英] Appending new column to dask dataframe

查看：144 发布时间：2020/10/15 18:40:58 python dask

本文介绍了将新列追加到dask数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个现有的dask数据框 df ，我希望执行以下操作：

  df ['rand_index'] = np.random.permutation（len（df））

但是，这会产生错误，列分配不支持ndarray类型。我尝试使用 df.assign（rand_index = np.random.permutation（len（df））会给出相同的错误。

这里是一个最小的（不是）工作示例：

 进口熊猫为pd 
进口黄昏。 dataframe as dd 
 import numpy as np 
 
 df = dd.from_pandas（pd.DataFrame（{'A'：[1,2,3] * 10，'B'：[3 ，2,1] * 10}），npartitions = 10）
 df ['rand_index'] = np.random.permutation（len（df））

注意：

前面提到的问题使用 df = df.map_partitions（add_random_column_to_pandas_dataframe ，...），但我不确定这是否与该特定情况相关。

编辑1

我尝试了
df ['rand_index'] = dd.from_array（np.random.permutation（len_df）），执行时没有问题。当我检查 df.head（）时，似乎已经创建了新列。但是，当我查看 df时， .tail（） rand_index 是一堆 NaN s。

实际上只是为了确认我检查了 df .rand_index.max（）。compute（）小于 len（df）-1 。因此，这可能是 df.map_partitions 发挥作用的地方，因为我怀疑这是dask被分区的问题。在我的特定情况下，我有80个分区（不涉及示例情况）。

解决方案

您需要打开 np.random.permutation（len（df））转换为dask可以理解的类型：

  permutations = dd.from_array（np.random.permutation（len（df） ））
 df ['rand_index'] =排列
 df

收益：

  Dask DataFrame结构：
 AB rand_index 
 npartitions = 10 
 0 int64 int64 int32 
 3 ... ... 
 ... ... ... 
 27 ... ... ... 
 29 ... ... ... 
达斯克（Dask）名称：分配，61个任务

如果要 .compute（）计算实际结果，该由您自己决定。

This is a follow up question to Shuffling data in dask.

I have an existing dask dataframe df where I wish to do the following:

df['rand_index'] = np.random.permutation(len(df))

However, this gives the error, Column assignment doesn't support type ndarray. I tried to use df.assign(rand_index = np.random.permutation(len(df)) which gives the same error.

Here is a minimal (not) working sample:

import pandas as pd
import dask.dataframe as dd
import numpy as np

df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*10, 'B':[3,2,1]*10}), npartitions=10)
df['rand_index'] = np.random.permutation(len(df))

Note:

The previous question mentioned using df = df.map_partitions(add_random_column_to_pandas_dataframe, ...) but I'm not sure if that is relevant to this particular case.

Edit 1

I attempted df['rand_index'] = dd.from_array(np.random.permutation(len_df)) which, executed without an issue. When I inspected df.head() it seems that the new column was created just fine. However, when I look at df.tail() the rand_index is a bunch of NaNs.

In fact just to confirm I checked df.rand_index.max().compute() which turned out to be smaller than len(df)-1. So this is probably where df.map_partitions comes into play as I suspect this is an issue with dask being partitioned. In my particular case I have 80 partitions (not referring to the sample case).

解决方案

You would need to turn np.random.permutation(len(df)) into type that dask understands:

permutations = dd.from_array(np.random.permutation(len(df)))
df['rand_index'] = permutations
df

This would yield:

Dask DataFrame Structure:
                    A      B rand_index
npartitions=10                         
0               int64  int64      int32
3                 ...    ...        ...
...               ...    ...        ...
27                ...    ...        ...
29                ...    ...        ...
Dask Name: assign, 61 tasks

So it is up to you now if you want to .compute() to calculate actual results.

这篇关于将新列追加到dask数据框的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将新列追加到dask数据框 [英] Appending new column to dask dataframe

问题描述

注意：

编辑1

Note:

Edit 1

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将新列追加到dask数据框 [英] Appending new column to dask dataframe

问题描述

注意：

编辑1

Note:

Edit 1

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭