21.2 Alternatives to rbind
for big data frames
When working with very large data frames, and iteratively adding rows using rbind
, say, within a for loop, to grow the data frame, the process starts out fast but tends to become slow. This is because rbind
creates a new data frame each time it is used, and as the data frame gets bigger and bigger, it becomes more difficult to find contiguous blocks of memory to store the data frame. Here is an example that starts out taking 1-2 second per iteration at j=1
but ends up taking 3-4 seconds at j=10
. If you have memory issues running this example, make N
or J
smaller (but not too small because you won’t notice it slow down).
Code
In cases like these, it is faster to initialize the large data frame up front and insert rows into that data frame like this.
Code
start.time2 = Sys.time()
set.seed(1)
N = 100000000
J = 3
df2 = data.frame(x = rep(NA, N*J), ## Initialize large data frame
j = rep(NA, N*J))
for(j in 1:J){
cat(j, '')
print(Sys.time())
temp = data.frame(x = rnorm(N),
j = j)
## Instead of rbind-ing, insert this data frame into the large data frame
# when J=1, start.row = 1, when J=2, start.row = N+1
start.row = 1 + N*(j-1)
end.row = N*j
df2[start.row:end.row,] = temp
}
final.time2 = Sys.time()
print(final.time2)
This one takes 1-2 seconds each loop from j = 1
to j = 10
and does not appear to slow down for higher j
. Here are the total times for each for
loop.
for
loop #1, withrbind
, took 21 secondsfor
loop #2, withoutrbind
, took 19 seconds
The difference isn’t huge in this case, but it can be very big
This doesn’t mean you should never use rbind
in a for
loop. With smaller data, you won’t notice the difference, and it is easier to code up.