Functional API for data.table aggregation which allows capture of associated aggregate calls so they can be recomputed later.
aggregate2.Rd
Functional API for data.table aggregation which allows capture of associated aggregate calls so they can be recomputed later.
Usage
aggregate2(
x,
by,
...,
nthread = 1,
progress = interactive(),
BPPARAM = NULL,
enlist = TRUE,
moreArgs = list()
)
Arguments
- x
data.table
- by
character
One or more valid column names inx
to compute groups using.- ...
call
One or more aggregations to compute for each group by in x. If you name aggregation calls, that will be the column name of the value in the resultingdata.table
otherwise a default name will be parsed from the function name and its first argument, which is assumed to be the name of the column being aggregated over.- nthread
numeric(1)
Number of threads to use for split-apply-combine parallelization. UsesBiocParllel::bplapply
if nthread > 1 or you pass inBPPARAM
. Does not modify data.table threads, so be sure to use setDTthreads for reasonable nested parallelism. See details for performance considerations.- progress
logical(1)
Display a progress bar for parallelized computations? Only works ifbpprogressbar<-
is defined for the current BiocParallel back-end.- BPPARAM
BiocParallelParam
object. Use to customized the the parallization back-end of bplapply. Note, nthread over-rides any settings from BPPARAM as long asbpworkers<-
is defined for that class.- enlist
logical(1)
Default isTRUE
. Set toFALSE
to evaluate the first call in...
withindata.table
groups. See details for more information.- moreArgs
list()
A named list where each item is an argument one of the calls in...
which is not a column in the table being aggregated. Use to further parameterize you calls. Please note that these are not added to your aggregate calls unless you specify the names in the call.
Details
Use of Non-Standard Evaluation
Arguments in ...
are substituted and wrapped in a list, which is passed
through to the j argument of [.data.table
internally. The function currently
tries to build informative column names for unnamed arguments in ...
by
appending the name of each function call with the name of its first argument,
which is assumed to be the column name being aggregated over. If an argument
to ...
is named, that will be the column name of its value in the resulting
data.table
.
Enlisting
The primary use case for enlist=FALSE
is to allow computation of dependent
aggregations, where the output from a previous aggregation is required in a
subsequent one. For this case, wrap your call in {
and assign intermediate
results to variables, returning the final results as a list where each list
item will become a column in the final table with the corresponding name.
Name inference is disabled for this case, since it is assumed you will name
the returned list items appropriately.
A major advantage over multiple calls to aggregate
is that
the overhead of parallelization is paid only once even for complex multi-step
computations like fitting a model, capturing its paramters, and making
predictions using it. It also allows capturing arbitrarily complex calls
which can be recomputed later using the
update,TreatmentResponseExperiment-method
A potential disadvantage is increased RAM usage per
thread due to storing intermediate values in variables, as well as any
memory allocation overhead associate therewith.
See also
data.table::[.data.table
, BiocParallel::bplapply