Monday, July 29, 2013

Bosco & OSG @ XSEDE13

Last week, several members of the Bosco team attended the XSEDE 13 Conference.  We enjoyed the great San Diego weather, and even caught a little ComicCon, which was being held next door when I arrived.  Here are some of the highlights.

OSG Summer School Programming Team Wins First place!

OSG Summer School Programming Team: Zhe Zhang, Travis Boettcher, Matthew Armbruster, Ben Albrecht, Cassandra Schaening
A group of OSG summer school students formed a team for the student programming challenge.  They where given 10 difficult programming challenges to solve and had all day from 8 a.m. to 4 p.m. to work on them.  On Friday, the results where released and the OSG team won first place in the programming challenge!

Carrier Dinner

On Tuesday evening, the XSEDE conference hosted a dinner on the USS Midway anchored not far from the conference.  It was a great evening of food, drinks, and tours!
The USS Midway anchored in San Diego

Bosco @ XSEDE Poster Session

Wednesday night was the XSEDE Poster Session.  We had a great reception.  Many people approached us, asking questions about Bosco and how it could help them.

Derek Weitzel and Miha Ahronovitz

Thursday, July 11, 2013

Creating a native Mac Installer for Bosco

It has been our goal for some time to create a native installer for our supported platforms.  This week, we created a prototype for the first of those platforms, the Mac installer.

The PKG file

The first step to creating the installer is the PKG file.  I installed Bosco the usual way, using the multi-platform installer.  Then, I looked through the Bosco files (using a find command) to locate all of the instances of hard coded paths.  They where located in the, bosco_setenv, and the condor_config files.  I modified those to use the environment variable HOME.

Then I used PackageMaker from the auxiliary tools for Mac package available on the Mac developer page (free registration required).  In PackageMaker, I opened the Bosco directory in the home directory.

I unchecked the 'Require admin authentication' because bosco will be installed in the user's home directory.  After these simple steps, I built the installer and tested, everything worked!

DMG File

A DMG is a disk image that is used for software distribution.  The first step is to create the Bosco.dmg disk image using Mac's built-in Disk Utility.

After mounting the DMG, I copied the Bosco installer into the image.  Then, I ran the command:
bless --folder /Volumes/Bosco --openfolder /Volumes/Bosco

This will cause the DMG to automatically open a window when mounting it.

Where to get this

The Bosco DMG makes it very easy to install Bosco on a Mac.  The DMG is available on Dropbox, and will be put in production when a few more minor fixes are put in place.

Friday, July 5, 2013

The next step for Bosco: BoscoR

With the 1.1 release of Bosco in January, the Bosco team finally had a usable product for researchers.  It didn't have the best usability, but it was a solid tool to build on.  When planning the next release, 1.2, we determined that we would not include many new features and instead focus on usability.  With that focus, we interviewed researchers and looked around at our colleagues to see how most researchers interacted with their data on a daily basis, which was not with Condor or Bosco.  Researchers used applications such as Matlab, Galaxy, or R.  After investigating the options, we determined that R would be a great first application to integrate with Bosco.  It is open source, has a strong community, and most importantly, it is heavily used by the researchers around us.

The Initial Steps

Our first step was to look at how people are submitting R to HTC resources now.  The CHTC does a great job of submitting entire R scripts for Wisconsin researchers.  There is also SubmitR from Hubzero which integrates a GUI with submission of R scripts.  Both of these methods had good and bad attributes.  

The CHTC method was very easy to submit complex R scripts with many package dependencies to a very large number of nodes.  But the researcher had to submit only on UW machines with the proper scripts installed.  Also, the researcher still had to learn at least a minimal amount of HTCondor commands to successfully run their jobs.  The researcher had to leave their environment and learn another, HTCondor.

SubmitR, on the other hand, was very easy to use, with a graphical interface to upload and start jobs.  SubmitR is open source, but deeply, deeply integrated with HubZero.  Therefore, any submissions to SubmitR would need to be centralized to a small number of locations.  Also, SubmitR had the same issue, the researcher had to leave their environment to learn another, in this case, HubZero and SubmitR.

We found that the with Bosco installed on the researcher's laptop, we could leverage that locality to create a better interface.  Our next step was to look a possible packages that could submit R jobs to clusters.  They are listed on the HPC page for R packages.  The only package that fit our requirements, open source, released, and in CRAN was GridR.

Integrating Bosco with GridR

When first looking at the source of GridR, it can be very intimidating.  It is using asynchronous assignments to variables (which I would later learn is frowned upon in the R community).  It submits jobs by forking a new processes to do the submission and monitor the jobs.  Is uses R on the remote side to load the functions and variables to run the processing.

Even though GridR is complicated, it has one great feature, everything is done from inside the R environment.  The researcher simply loads a package and uses a familiar function, apply.  With that function, the researcher can send a R function to a remote cluster to be processed, and the result will be written asynchronously back to the environment.

When I began looking at modifying GridR, my first step was to find the source repository.  Unfortunately, the package hasn't received an update since 2009, and the source repo was no where to be found.   Luckily though, another researcher from Harvard had just modified GridR to submit to R jobs to a locally installed HTCondor cluster.  By studying his changes, I was able to fork his work and add Bosco submission to GridR.

The modifications to GridR where minimal.  I needed to add a new submission type to package, bosco, and modify the local submit scripts with the Bosco submit settings.  But GridR requires R to be installed on the remote cluster, so how is Bosco going to get R to the remote worker node before the job executes?

Several different ideas where floated to install R on the remote worker node.  We settled on a solution that used a bootstrap method.  When the remote job started, instead of directly executing R, it would run a custom bootstrap script written for Bosco submissions.  This bootstrap script would determine if R is already installed, and if not, it will download and install R either in the user's home directory or temporarily on the worker node for the duration of the job.  This bootstraping enables GridR submission to successfully run anywhere with network access, whether it's on a campus cluster, or even national infrastructure, such as the OSG or XSEDE.

Current Status

The first round of alpha testers have reported great success at running BoscoR.  From installation to submission, all alpha testers were successful at running R jobs on remote clusters without any help from the Bosco developers.  We also got much needed feedback from the R users on how the interface worked for them.

The most common feedback was the use of the apply function for running multiple jobs.  The testers are used to the apply function built into the R language, which runs a function once for every element in a list, then returns a list of results.  GridR can do this, but it is not the default action of GridR's apply, therefore it provided an inconsistent interface with the rest of the R language.  The Bosco team is working hard on this, and it will be fixed in the next BoscoR beta release out next week.

The Bosco team has also submitted the updates to GridR for review by CRAN.  Since GridR has not received an update since 2009, many policies have changed as to how packages can interact with the user's environment.  As mentioned earlier, asynchronously modifying the user's environment, such as to set a variable to the result of a remotely execute function, is against CRAN policy.  I am confident that this can be worked out, but it will take effort to include the updated GridR in CRAN, and it may take a while.

Closing Thoughts

Integrating with scientific applications and frameworks is the next step in Bosco to increase usability.  Bosco cannot remain just a tool to be used, it must have interfaces built on top of it to connect with researchers, to reach out to them.  The researchers do not want to leave their environment, and BoscoR provides them a way to stay in their comfort zone, all while completing their research at new levels.

Getting BoscoR

You can get BoscoR from the Bosco website.