Thursday, November 29, 2012

BOSCO v1.1 Features: Multi-Cluster Support

This is the forth in the in the series of features in the 1.1 release of BOSCO.  The previous posts have focused on SSH File Transfer, Single Port Usage, and Multi-OS Support.  All of these features are new in the 1.1 release.  But now I want to talk about a feature that was in the previous 1.0 release, but is important enough to discuss again, Multi-Cluster Support.  This feature is very technically challenging, so I will start with why you care about Multi-Cluster in BOSCO.

Why do you care about Multi-Cluster?

On a typical campus, the each department may have it's own cluster for their use.  Physics may have a cluster, Computer Science has one, and Chemistry may have another.  Or a computing center may have multiple clusters reflecting multiple generations of hardware.  In either of these cases, users have to pick which cluster to submit jobs to, rather than submitting to which ever has the most free cores.

You don't care what cluster you run on.  You don't care how to submit jobs to the PBS Chemistry cluster or the SGE Computer Science cluster.  In addition, who wants to learn two different cluster submission methods.  You only care about finishing your research.

BOSCO can unify the clusters by overlaying each cluster with an on demand Condor cluster.  That way, you only learn the Condor submission method.  The Condor job you submit to BOSCO will then be run at whichever cluster has the first free cores for you to run on.

What Is Multi-Cluster?

In BOSCO, Multi-Cluster is the feature that allows for submission to multiple clusters with a single submit file.  A user can submit a regular file, such as:
universe = vanilla
output = stdout.out
error = stderr.err
Executable = /bin/echo
arguments = hello
log = job.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
queue 1

When the job is submitted, BOSCO will submit glideins to each of the clusters that are configured in BOSCO.  The glideins will start on the worker nodes of the remote clusters and join the local pool at the submission host, creating an on-demand Condor pool.  The jobs will then run on the remote worker nodes through the glideins.  This may be best illustrated by a picture I made for my thesis:

Overview of the BOSCO job submission to a PBS cluster

In the diagram, first BOSCO will submit the glidein to the PBS cluster.  Second, PBS will schedule and start the glidein on a worker node.  Finally, the glidein will report to the BOSCO submit host and will start running user jobs.

The Multi-Cluster feature will continue to be a integral part of the BOSCO.

The Beta release of BOSCO is due out next week (fingers crossed!).  Be watching this blog, and the BOSCO website for more news.

Tuesday, November 6, 2012

BOSCO v1.1 Features: Multi-OS Support

This is part 3 of my ongoing series describing new features in BOSCO v1.1.  Part 1 covered file transfer over SSH, Part 2 covered single port usage.  This post will cover the Multi-OS Support in BOSCO.

What is it?

The Multi-OS feature is intended to allow users to submit to clusters that may not be the same operating system as the submit host.  This is especially useful when users are running a submit host on their personal computer that is running Debian, while the supercomputer in the next building is running Red Hat 6, a common occurrence.

The Multi-OS components follow a basic process in order to operate:
  1. Detect the remote operating system with the findplatform script.
  2. Download (from the cloud?) the appropriate bosco version for the platform.
  3. Transfer the files needed on the remote cluster.  This includes grabbing the libraries and binaries for the campus factory's glideins.  The glidein creation takes the most time of this process as it needs to compress the libraries and binaries before transferring.
  4. When BOSCO detects jobs idle on the submit host, it will start glideins appropriate for the platform to service the jobs.
The Multi-OS support required modification of the cluster addition and adding both a findplatform script and a glidein_creation script.  

Why do I care?

It is becoming increasingly common that what users run on their machines and what is run on supercomputers are different.  When this is true, it is difficult to install software from one onto the other.  The Multi-OS feature will greatly simplify the installation of BOSCO on clusters.

Our goal with the Multi-OS support is that the users may not know it is even working.  The users just say: "I want to run on this cluster", and BOSCO makes it happen.  No matter what operating system is running on the remote cluster.

One of my tests simulated a possible user scenario   I was running a updated RHEL 6 machine which I installed BOSCO.  I wanted to submit jobs to a RHEL 5 cluster located in our datacenter.  If I simply copied over the bosco install from the RHEL 6 submit host, none of the binaries would work.  But instead, I used the bosco_cluster -a to add the RHEL 5 cluster, and jobs ran seamlessly from the RHEL 6 machine to the RHEL 5.  

The Multi-OS support is available in the latest alphas available on the bosco download page.

Thursday, November 1, 2012

BOSCO v1.1 Features: Single Port Usage

Welcome to part 2 of my ongoing series of v1.1 features for BOSCO.  Part 1 was on SSH File Transfer.

This time, I'll talk about a new feature that we didn't planned on implementing at first, using only a single port for all communication.  After a small investigation, it was discovered that using a single port is very simple, and with no interruption to other components.  I talked briefly about it in a previous post.

What is it?

In 1.0 of BOSCO, the submit host needed a lot of ports open for connections originating from the remote clusters.  This was caused by 2 mechanisms:
  1. File transfer from the BOSCO submit host to the cluster login node before issuing the local submit call (qsub, condor_submit...).  This opens ports on the submit host because the cluster would call out to the submit host to initiate transfers.
  2. Connections for control, status, and workflow management between the cluster worker nodes and BOSCO submit host.  This is the Campus Factory, which gives BOSCO the traditional Condor look and feel.
In order for BOSCO to function, the submit host needs a large swath of ports in order to operate correctly.  Also, as you scale, you will need even more ports open.

The file transfers from the submit host to the login node are now being transferred using SSH, see my previous post.

With the new feature of single port usage, all control, status, and workflow management connections are routed through HTCondor's share_port_daemon on port 11000 (which is hardcoded, but I picked at random).

Why should I care?

Limiting BOSCO to using only 1 incoming port is very useful for users on systems not managed by them.  The node will only need 1 port open in order to run BOSCO, 11000.  If the system has a firewall, you only have to request port 11000 be opened, rather than huge swaths.  If you manage the system, then you will be happy that only 1 port needs to be opened in order to allow BOSCO submissions.

Administrators will like this feature as it is more in line with other applications that they may run.  For example, httpd only requires 1 port, 80.  Now BOSCO is in the same realm, only requiring 1 port, 11000.