20.2 Parallel processing with foreach

20.2 Parallel processing with `foreach`

If you have a for loop that is embarrassingly parallel, where each iteration is completely independent of the others, you can potentially use foreach to run mutiple iterations simultaneously on multiple cores on your computer. Here is an embarrassingly simple example.

This takes about 100 seconds.

Code

print(Sys.time())
for (j in 1:100){
  cat(j, '')
  Sys.sleep(1)
}
print(Sys.time())

This takes 19 seconds with 6 cores (plus some time spent on the initial setup). Note the j = instead of j in.

Code

library(doSNOW)
library(foreach)
library(parallel)


## Setup
n.cores = detectCores()-2 ## leave 2 cores unused so I can keep using computer

print(Sys.time())
cl = makeCluster(n.cores)
registerDoSNOW(cl)
print(Sys.time())
foreach(j = 1:100, .verbose=F) %dopar% {
  library(dplyr) ## you might need to load libraries that you use in the loop
  cat(j,'')
  Sys.sleep(1)
}
stopCluster(cl)
print(Sys.time())

Depending where are using foreach, you may want to save .rds files each run through the loop, or you may want to combine the result of each iteration of the loop at the end. This link shows the argument .combine can be used to do the latter. https://cran.r-project.org/web/packages/foreach/vignettes/foreach.html

Here is an example of using rbind to combine data frames from each iteration: https://stackoverflow.com/questions/14815810/append-rows-to-dataframe-using-foreach-package. Here is the minimal reproducible example from that page.

Code

resultdf = foreach(i = 1:10, .combine = rbind) %dopar% { 
  data.frame(x = runif(4), 
             i = i)
  }

The final objects from each foreach iteration get rbinded together. In this case, the final object in each iteration is a data.frame with 4 rows and 2 columns.

If you have a process that takes a long time, you’ll definitely want to save your results somewhere, even if you are using this rbind functionality. If, for example, you are using foreach inside another for loop, then you can save the result of foreach like this.

Code

## Non-parallel for loop
for (j in 1:10){
  
  ## Parallel for loop
  resultdf = foreach(i = 1:10, .combine = rbind) %dopar% { 
    
    ## Do some stuff that creates a data frame 
    data.frame(x = runif(4), 
               i = i)
  }
  
  ## create file name for this iteration, and save
  myfilename = paste('filename', j, '.rds') 
  saveRDS(df, file=myfilename)      
}

Or, if the foreach iteration takes a long time, you can save the result instead each iteration.

Code

## Non-parallel for loop
for (j in 1:10){
  
  ## Parallel for loop
  foreach(i = 1:10, .combine = rbind) %dopar% { 
    
    ## Do some stuff that creates a data frame or another object
    df = data.frame(x = runif(4), 
                    i = i)
    
    ## create file name for this iteration, and save
    myfilename = paste('filename-', j, '-', i, '.rds') 
    saveRDS(df, file=myfilename)      
  }

}

Your approach will depend on how long each iteration takes to run, how big the file sizes are etc. However you choose to do it, you will have a collection of .rds files that can be read back into R and joined using rbind or some other method.