The Initial Steps
Our first step was to look at how people are submitting R to HTC resources now. The CHTC does a great job of submitting entire R scripts for Wisconsin researchers. There is also SubmitR from Hubzero which integrates a GUI with submission of R scripts. Both of these methods had good and bad attributes.
The CHTC method was very easy to submit complex R scripts with many package dependencies to a very large number of nodes. But the researcher had to submit only on UW machines with the proper scripts installed. Also, the researcher still had to learn at least a minimal amount of HTCondor commands to successfully run their jobs. The researcher had to leave their environment and learn another, HTCondor.
SubmitR, on the other hand, was very easy to use, with a graphical interface to upload and start jobs. SubmitR is open source, but deeply, deeply integrated with HubZero. Therefore, any submissions to SubmitR would need to be centralized to a small number of locations. Also, SubmitR had the same issue, the researcher had to leave their environment to learn another, in this case, HubZero and SubmitR.
We found that the with Bosco installed on the researcher's laptop, we could leverage that locality to create a better interface. Our next step was to look a possible packages that could submit R jobs to clusters. They are listed on the HPC page for R packages. The only package that fit our requirements, open source, released, and in CRAN was GridR.
Integrating Bosco with GridR
When first looking at the source of GridR, it can be very intimidating. It is using asynchronous assignments to variables (which I would later learn is frowned upon in the R community). It submits jobs by forking a new processes to do the submission and monitor the jobs. Is uses R on the remote side to load the functions and variables to run the processing.
Even though GridR is complicated, it has one great feature, everything is done from inside the R environment. The researcher simply loads a package and uses a familiar function, apply. With that function, the researcher can send a R function to a remote cluster to be processed, and the result will be written asynchronously back to the environment.
When I began looking at modifying GridR, my first step was to find the source repository. Unfortunately, the package hasn't received an update since 2009, and the source repo was no where to be found. Luckily though, another researcher from Harvard had just modified GridR to submit to R jobs to a locally installed HTCondor cluster. By studying his changes, I was able to fork his work and add Bosco submission to GridR.
The modifications to GridR where minimal. I needed to add a new submission type to package, bosco, and modify the local submit scripts with the Bosco submit settings. But GridR requires R to be installed on the remote cluster, so how is Bosco going to get R to the remote worker node before the job executes?
Several different ideas where floated to install R on the remote worker node. We settled on a solution that used a bootstrap method. When the remote job started, instead of directly executing R, it would run a custom bootstrap script written for Bosco submissions. This bootstrap script would determine if R is already installed, and if not, it will download and install R either in the user's home directory or temporarily on the worker node for the duration of the job. This bootstraping enables GridR submission to successfully run anywhere with network access, whether it's on a campus cluster, or even national infrastructure, such as the OSG or XSEDE.
Current Status
The first round of alpha testers have reported great success at running BoscoR. From installation to submission, all alpha testers were successful at running R jobs on remote clusters without any help from the Bosco developers. We also got much needed feedback from the R users on how the interface worked for them.
The most common feedback was the use of the apply function for running multiple jobs. The testers are used to the apply function built into the R language, which runs a function once for every element in a list, then returns a list of results. GridR can do this, but it is not the default action of GridR's apply, therefore it provided an inconsistent interface with the rest of the R language. The Bosco team is working hard on this, and it will be fixed in the next BoscoR beta release out next week.
The Bosco team has also submitted the updates to GridR for review by CRAN. Since GridR has not received an update since 2009, many policies have changed as to how packages can interact with the user's environment. As mentioned earlier, asynchronously modifying the user's environment, such as to set a variable to the result of a remotely execute function, is against CRAN policy. I am confident that this can be worked out, but it will take effort to include the updated GridR in CRAN, and it may take a while.
Closing Thoughts
Integrating with scientific applications and frameworks is the next step in Bosco to increase usability. Bosco cannot remain just a tool to be used, it must have interfaces built on top of it to connect with researchers, to reach out to them. The researchers do not want to leave their environment, and BoscoR provides them a way to stay in their comfort zone, all while completing their research at new levels.
Getting BoscoR
You can get BoscoR from the Bosco website.
No comments:
Post a Comment