Skip to contents

Functional API for data.table aggregation which allows capture of associated aggregate calls so they can be recomputed later.

Usage

aggregate2(
  x,
  by,
  ...,
  nthread = 1,
  progress = interactive(),
  BPPARAM = NULL,
  enlist = TRUE,
  moreArgs = list()
)

Arguments

x

data.table

by

character One or more valid column names in x to compute groups using.

...

call One or more aggregations to compute for each group by in x. If you name aggregation calls, that will be the column name of the value in the resulting data.table otherwise a default name will be parsed from the function name and its first argument, which is assumed to be the name of the column being aggregated over.

nthread

numeric(1) Number of threads to use for split-apply-combine parallelization. Uses BiocParllel::bplapply if nthread > 1 or you pass in BPPARAM. Does not modify data.table threads, so be sure to use setDTthreads for reasonable nested parallelism. See details for performance considerations.

progress

logical(1) Display a progress bar for parallelized computations? Only works if bpprogressbar<- is defined for the current BiocParallel back-end.

BPPARAM

BiocParallelParam object. Use to customized the the parallization back-end of bplapply. Note, nthread over-rides any settings from BPPARAM as long as bpworkers<- is defined for that class.

enlist

logical(1) Default is TRUE. Set to FALSE to evaluate the first call in ... within data.table groups. See details for more information.

moreArgs

list() A named list where each item is an argument one of the calls in ... which is not a column in the table being aggregated. Use to further parameterize you calls. Please note that these are not added to your aggregate calls unless you specify the names in the call.

Value

data.table of aggregation results.

Details

Use of Non-Standard Evaluation

Arguments in ... are substituted and wrapped in a list, which is passed through to the j argument of [.data.table internally. The function currently tries to build informative column names for unnamed arguments in ... by appending the name of each function call with the name of its first argument, which is assumed to be the column name being aggregated over. If an argument to ... is named, that will be the column name of its value in the resulting data.table.

Enlisting

The primary use case for enlist=FALSE is to allow computation of dependent aggregations, where the output from a previous aggregation is required in a subsequent one. For this case, wrap your call in { and assign intermediate results to variables, returning the final results as a list where each list item will become a column in the final table with the corresponding name. Name inference is disabled for this case, since it is assumed you will name the returned list items appropriately. A major advantage over multiple calls to aggregate is that the overhead of parallelization is paid only once even for complex multi-step computations like fitting a model, capturing its paramters, and making predictions using it. It also allows capturing arbitrarily complex calls which can be recomputed later using the update,TreatmentResponseExperiment-method A potential disadvantage is increased RAM usage per thread due to storing intermediate values in variables, as well as any memory allocation overhead associate therewith.

See also

data.table::[.data.table, BiocParallel::bplapply