Thursday, April 11, 2013

Speeding up R

I've been programming in R for a while now. Whenever I've had performance problems, it almost always is due to data.frame usage. To check what's slowing down your R code, just use the Rprof command like so:

Rprof("summ.prof")
d <- data.frame(x=rnorm(10000), y=rnorm(10000))
for(i in 1:nrow(d)) {
d[i,"x"] <- cos(d[i, "y"]) - d[i, "x"]
}
Rprof(NULL)
view raw gistfile1.r hosted with ❤ by GitHub

I wrote this code to be particularly slow due to its data.frame usage. You may view the results of the profiling by running R CMD Rprof summ.prof:
Each sample represents 0.02 seconds.
Total run time: 2.98 seconds.

Total seconds: time spent in function and callees.
Self seconds: time spent in function alone.

   %       total       %        self
 total    seconds     self    seconds    name
  81.2      2.42       1.3      0.04     "[<-"
  79.9      2.38      65.8      1.96     "[<-.data.frame"
  18.1      0.54      10.1      0.30     "[.data.frame"
  18.1      0.54       0.0      0.00     "["
  12.1      0.36       1.3      0.04     "%in%"
  11.4      0.34       9.4      0.28     "match"
   4.7      0.14       4.0      0.12     "anyDuplicated"
   2.0      0.06       2.0      0.06     "names"
   2.0      0.06       2.0      0.06     "sys.call"
   1.3      0.04       1.3      0.04     "=="
   0.7      0.02       0.7      0.02     ".row_names_info"
   0.7      0.02       0.7      0.02     "NROW"
   0.7      0.02       0.7      0.02     "anyDuplicated.default"
   0.7      0.02       0.7      0.02     "cos"


   %        self       %      total
  self    seconds    total   seconds    name
  65.8      1.96      79.9      2.38     "[<-.data.frame"
  10.1      0.30      18.1      0.54     "[.data.frame"
   9.4      0.28      11.4      0.34     "match"
   4.0      0.12       4.7      0.14     "anyDuplicated"
   2.0      0.06       2.0      0.06     "names"
   2.0      0.06       2.0      0.06     "sys.call"
   1.3      0.04      81.2      2.42     "[<-"
   1.3      0.04      12.1      0.36     "%in%"
   1.3      0.04       1.3      0.04     "=="
   0.7      0.02       0.7      0.02     ".row_names_info"
   0.7      0.02       0.7      0.02     "NROW"
   0.7      0.02       0.7      0.02     "anyDuplicated.default"
   0.7      0.02       0.7      0.02     "cos"
As you can see, it's very slow. After profiling your own code, if you find that the top calls are [.data.frame or [<-.data.frame, then you have a data.frame problem. Here's how I solve this, in order of things I try:

  1. avoid loops, use vectorized code (no for loops, no apply, no sapply). In the example, use d[,1]<-, for assigning an entire column
  2. Use numeric indices when possible. In our example, that means using d[i, 1] instead of d[i, "x"]
  3. Get rid of the data.frame for heavy calculations by using the data.matrix command. In the example above, just use d.matrix <- data.matrix(d)


Rewriting the silly example above,

Rprof("summ.out")
d <- data.frame(x=rnorm(10000), y=rnorm(10000))
d.matrix <- data.matrix(d)
d.matrix[,1] <- cos(d.matrix[,2]) - d.matrix[,1]
d <- data.frame(d.matrix)
Rprof(NULL)
view raw gistfile1.r hosted with ❤ by GitHub

This time, it runs too quickly for any events to be recorded with the profiling. The execution time is 0.192 seconds, whereas the first version is 3.48 seconds. A pretty good speed-up.

No comments:

Post a Comment