Thursday, April 11, 2013

Speeding up R

I've been programming in R for a while now. Whenever I've had performance problems, it almost always is due to data.frame usage. To check what's slowing down your R code, just use the Rprof command like so:


I wrote this code to be particularly slow due to its data.frame usage. You may view the results of the profiling by running R CMD Rprof summ.prof:

Each sample represents 0.02 seconds.
Total run time: 2.98 seconds.

Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.

   %       total       %        self
 total    seconds     self    seconds    name
  81.2      2.42       1.3      0.04     "[<-"
  79.9      2.38      65.8      1.96     "[<-.data.frame"
  18.1      0.54      10.1      0.30     "[.data.frame"
  18.1      0.54       0.0      0.00     "["
  12.1      0.36       1.3      0.04     "%in%"
  11.4      0.34       9.4      0.28     "match"
   4.7      0.14       4.0      0.12     "anyDuplicated"
   2.0      0.06       2.0      0.06     "names"
   2.0      0.06       2.0      0.06     "sys.call"
   1.3      0.04       1.3      0.04     "=="
   0.7      0.02       0.7      0.02     ".row_names_info"
   0.7      0.02       0.7      0.02     "NROW"
   0.7      0.02       0.7      0.02     "anyDuplicated.default"
   0.7      0.02       0.7      0.02     "cos"


   %        self       %      total
  self    seconds    total   seconds    name
  65.8      1.96      79.9      2.38     "[<-.data.frame"
  10.1      0.30      18.1      0.54     "[.data.frame"
   9.4      0.28      11.4      0.34     "match"
   4.0      0.12       4.7      0.14     "anyDuplicated"
   2.0      0.06       2.0      0.06     "names"
   2.0      0.06       2.0      0.06     "sys.call"
   1.3      0.04      81.2      2.42     "[<-"
   1.3      0.04      12.1      0.36     "%in%"
   1.3      0.04       1.3      0.04     "=="
   0.7      0.02       0.7      0.02     ".row_names_info"
   0.7      0.02       0.7      0.02     "NROW"
   0.7      0.02       0.7      0.02     "anyDuplicated.default"
   0.7      0.02       0.7      0.02     "cos"
As you can see, it's very slow. After profiling your own code, if you find that the top calls are [.data.frame or [<-.data.frame, then you have a data.frame problem. Here's how I solve this, in order of things I try:

  1. avoid loops, use vectorized code (no for loops, no apply, no sapply). In the example, use d[,1]<-, for assigning an entire column
  2. Use numeric indices when possible. In our example, that means using d[i, 1] instead of d[i, "x"]
  3. Get rid of the data.frame for heavy calculations by using the data.matrix command. In the example above, just use d.matrix <- data.matrix(d)


Rewriting the silly example above,


This time, it runs too quickly for any events to be recorded with the profiling. The execution time is 0.192 seconds, whereas the first version is 3.48 seconds. A pretty good speed-up.

No comments:

Post a Comment