20.2 Parallel processing with foreach
If you have a for loop that is embarrassingly parallel, where each iteration is completely independent of the others, you can potentially use foreach to run mutiple iterations simultaneously on multiple cores on your computer. Here is an embarrassingly simple example.
This takes about 100 seconds.
This takes 19 seconds with 6 cores (plus some time spent on the initial setup). Note the j =
instead of j in
.
Code
library(doSNOW)
library(foreach)
library(parallel)
## Setup
n.cores = detectCores()-2 ## leave 2 cores unused so I can keep using computer
print(Sys.time())
cl = makeCluster(n.cores)
registerDoSNOW(cl)
print(Sys.time())
foreach(j = 1:100, .verbose=F) %dopar% {
library(dplyr) ## you might need to load libraries that you use in the loop
cat(j,'')
Sys.sleep(1)
}
stopCluster(cl)
print(Sys.time())
Depending where are using foreach
, you may want to save .rds
files each run through the loop, or you may want to combine the result of each iteration of the loop at the end. This link shows the argument .combine
can be used to do the latter. https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html
Here is an example of using rbind
to combine data frames from each iteration: https://stackoverflow.com/questions/14815810/append-rows-to-dataframe-using-foreach-package. Here is the minimal reproducible example from that page.
The final objects from each foreach
iteration get rbind
ed together. In this case, the final object in each iteration is a data.frame
with 4 rows and 2 columns.
If you have a process that takes a long time, you’ll definitely want to save your results somewhere, even if you are using this rbind
functionality. If, for example, you are using foreach
inside another for
loop, then you can save the result of foreach
like this.
Code
## Non-parallel for loop
for (j in 1:10){
## Parallel for loop
resultdf = foreach(i = 1:10, .combine = rbind) %dopar% {
## Do some stuff that creates a data frame
data.frame(x = runif(4),
i = i)
}
## create file name for this iteration, and save
myfilename = paste('filename', j, '.rds')
saveRDS(df, file=myfilename)
}
Or, if the foreach
iteration takes a long time, you can save the result instead each iteration.
Code
## Non-parallel for loop
for (j in 1:10){
## Parallel for loop
foreach(i = 1:10, .combine = rbind) %dopar% {
## Do some stuff that creates a data frame or another object
df = data.frame(x = runif(4),
i = i)
## create file name for this iteration, and save
myfilename = paste('filename-', j, '-', i, '.rds')
saveRDS(df, file=myfilename)
}
}
Your approach will depend on how long each iteration takes to run, how big the file sizes are etc. However you choose to do it, you will have a collection of .rds
files that can be read back into R and joined using rbind
or some other method.