21.2 Alternatives to rbind for big data frames

When working with very large data frames, and iteratively adding rows using rbind, say, within a for loop, to grow the data frame, the process starts out fast but tends to become slow. This is because rbind creates a new data frame each time it is used, and as the data frame gets bigger and bigger, it becomes more difficult to find contiguous blocks of memory to store the data frame. Here is an example that starts out taking 1-2 second per iteration at j=1 but ends up taking 3-4 seconds at j=10. If you have memory issues running this example, make N or J smaller (but not too small because you won’t notice it slow down).

Code
start.time1 = Sys.time()
set.seed(1)
N = 100000000
J = 3
df1 = NULL

for(j in 1:J){
  
  cat(j, '')
  print(Sys.time())
  
  temp = data.frame(x = rnorm(N), 
                    j = j)
  df1 = rbind(df1, temp)
  
}

final.time1 = Sys.time()
print(final.time1)

In cases like these, it is faster to initialize the large data frame up front and insert rows into that data frame like this.

Code
start.time2 = Sys.time()
set.seed(1)
N = 100000000
J = 3
df2 = data.frame(x = rep(NA, N*J), ## Initialize large data frame
                 j = rep(NA, N*J))

for(j in 1:J){
  
  cat(j, '')
  print(Sys.time())
  
  temp = data.frame(x = rnorm(N), 
                    j = j)
  
  ## Instead of rbind-ing, insert this data frame into the large data frame
  # when J=1, start.row = 1, when J=2, start.row = N+1
  start.row = 1 + N*(j-1) 
  end.row = N*j
  df2[start.row:end.row,] = temp
  
}

final.time2 = Sys.time()
print(final.time2)

This one takes 1-2 seconds each loop from j = 1 to j = 10 and does not appear to slow down for higher j. Here are the total times for each for loop.

Code
print(final.time1 - start.time1)
print(final.time2 - start.time2)
  • for loop #1, with rbind, took 21 seconds
  • for loop #2, without rbind, took 19 seconds

The difference isn’t huge in this case, but it can be very big This doesn’t mean you should never use rbind in a for loop. With smaller data, you won’t notice the difference, and it is easier to code up.