dplyr在行的子集上mutate / replace [英] dplyr mutate/replace on a subset of rows

查看:956
本文介绍了dplyr在行的子集上mutate / replace的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试一个基于dplyr的工作流程(而不是主要使用data.table,我习惯),我遇到了一个问题,我找不到一个等价dplyr解决方案。我通常遇到的情况,我需要有条件地更新/替换几个列基于单个条件。下面是一些使用我的data.table解决方案的示例代码:

I'm in the process of trying out a dplyr-based workflow (rather than using mostly data.table, which I'm used to), and I've come across a problem that I can't find an equivalent dplyr solution to. I commonly run into the scenario where I need to conditionally update/replace several columns based on a single condition. Here's some example code, with my data.table solution:

library(data.table)

# Create some sample data
set.seed(1)
dt <- data.table(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

# Replace the values of several columns for rows where measure is "exit"
dt <- dt[measure == 'exit', 
         `:=`(qty.exit = qty,
              cf = 0,
              delta.watts = 13)]

有一个简单的dplyr解决这个相同的问题吗?我想避免使用ifelse,因为我不想多次键入条件 - 这是一个简化的示例,但有时基于单个条件有很多分配。

Is there a simple dplyr solution to this same problem? I'd like to avoid using ifelse because I don't want to have to type the condition multiple times - this is a simplified example, but there are sometimes many assignments based on a single condition.

感谢您的帮助!

推荐答案

1)保持管道,(2)不要覆盖输入和(3)只要求指定一次条件:

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:

1a)mutate_cond 为可以合并到流水线​​中的数据框架或数据表创建一个简单的函数。此函数类似于 mutate ,但仅作用于满足条件的行:

1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
  condition <- eval(substitute(condition), .data, envir)
  .data[condition, ] <- .data[condition, ] %>% mutate(...)
  .data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b)mutate_last 这是数据框架或数据表的替代函数,它也是 mutate ,但仅在 group_by (如下面的示例),并且仅对最后一个组而不是每个组进行操作。注意,TRUE> FALSE,所以如果 group_by 指定一个条件,则 mutate_last p>

1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.

mutate_last <- function(.data, ...) {
  n <- n_groups(.data)
  indices <- attr(.data, "indices")[[n]] + 1
  .data[indices, ] <- .data[indices, ] %>% mutate(...)
  .data
}


DF %>% 
   group_by(is.exit = measure == 'exit') %>%
   mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
   ungroup() %>%
   select(-is.exit)

2)失效条件后来被删除。然后使用 ifelse 替换或带逻辑的算术,如图所示。这也适用于数据表。

2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
              qty.exit = ifelse(is.exit, qty, qty.exit),
              cf = (!is.exit) * cf,
              delta.watts = replace(delta.watts, is.exit, 13)) %>%
       select(-is.exit)

3)sqldf 我们可以通过sqldf包在流水线中使用SQL update (但不是数据表,除非我们转换它们 - 这可能表示dplyr中的错误。请参见 dplyr问题1579 )。由于更新的存在,但是实际上更新,我们可能看起来在此代码中修改了输入

3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.

library(sqldf)

DF %>% 
   do(sqldf(c("update '.' 
                 set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13 
                 where measure = 'exit'", 
              "select * from '.'")))

注意1:我们使用 DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
                 space = sample(1:4, 50, replace=T),
                 measure = sample(c('cfl', 'led', 'linear', 'exit'), 50, 
                               replace=T),
                 qty = round(runif(50) * 30),
                 qty.exit = 0,
                 delta.watts = sample(10.5:100.5, 50, replace=T),
                 cf = runif(50))

注意2:以便轻松指定更新行的子集也在dplyr问题中讨论了 134 631 1518 1573 631 是主线程, 1573 正在审查这里的答案。

Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.

这篇关于dplyr在行的子集上mutate / replace的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆