data.frame
usage. To check what's slowing down your R code, just use the Rprof
command like so:I wrote this code to be particularly slow due to its
data.frame
usage. You may view the results of the profiling by running R CMD Rprof summ.prof
:
Each sample represents 0.02 seconds. Total run time: 2.98 seconds. Total seconds: time spent in function and callees. Self seconds: time spent in function alone. % total % self total seconds self seconds name 81.2 2.42 1.3 0.04 "[<-" 79.9 2.38 65.8 1.96 "[<-.data.frame" 18.1 0.54 10.1 0.30 "[.data.frame" 18.1 0.54 0.0 0.00 "[" 12.1 0.36 1.3 0.04 "%in%" 11.4 0.34 9.4 0.28 "match" 4.7 0.14 4.0 0.12 "anyDuplicated" 2.0 0.06 2.0 0.06 "names" 2.0 0.06 2.0 0.06 "sys.call" 1.3 0.04 1.3 0.04 "==" 0.7 0.02 0.7 0.02 ".row_names_info" 0.7 0.02 0.7 0.02 "NROW" 0.7 0.02 0.7 0.02 "anyDuplicated.default" 0.7 0.02 0.7 0.02 "cos" % self % total self seconds total seconds name 65.8 1.96 79.9 2.38 "[<-.data.frame" 10.1 0.30 18.1 0.54 "[.data.frame" 9.4 0.28 11.4 0.34 "match" 4.0 0.12 4.7 0.14 "anyDuplicated" 2.0 0.06 2.0 0.06 "names" 2.0 0.06 2.0 0.06 "sys.call" 1.3 0.04 81.2 2.42 "[<-" 1.3 0.04 12.1 0.36 "%in%" 1.3 0.04 1.3 0.04 "==" 0.7 0.02 0.7 0.02 ".row_names_info" 0.7 0.02 0.7 0.02 "NROW" 0.7 0.02 0.7 0.02 "anyDuplicated.default" 0.7 0.02 0.7 0.02 "cos"As you can see, it's very slow. After profiling your own code, if you find that the top calls are
[.data.frame
or [<-.data.frame
, then you have a data.frame
problem. Here's how I solve this, in order of things I try:- avoid loops, use vectorized code (no
for
loops, noapply
, nosapply
). In the example, used[,1]<-
, for assigning an entire column - Use numeric indices when possible. In our example, that means using
d[i, 1]
instead ofd[i, "x"]
- Get rid of the
data.frame
for heavy calculations by using thedata.matrix
command. In the example above, just used.matrix <- data.matrix(d)
Rewriting the silly example above,
This time, it runs too quickly for any events to be recorded with the profiling. The execution time is 0.192 seconds, whereas the first version is 3.48 seconds. A pretty good speed-up.
No comments:
Post a Comment