Sunday, January 25, 2015

HTCondor CacheD: Caching for HTC - Part 2

In the previous post, I discussed why we decided to make the HTCondor CacheD.  This time, we will discuss the operation and design of the CacheD, as well as show an example utilizing a BLAST database.

It is important to note that the CacheD is still very much "dissertation-ware."  It functions enough to demonstrate the improvements, but not enough to be put into production.

The Cache

The fundamental unit that the CacheD works with is an immutable set of files in a cache.  A user creates and uploads files into the cache.  Once the upload is complete, the cache is committed and may not be altered at any time.  The cache has a set of metadata associated with it as well, stored as classads in a durable storage database (using the same techniques as the SchedD job queue).

The cache has a 'lease', or an expiration date.  This lease is a given amount of time that the cache is guaranteed to be available from a particular CacheD.  When creating the cache, the user provides a requested cache lifetime.  The CacheD can either accept or reject the requested cache lifetime.  Once the cache's lifetime expires, it can be deleted by the CacheD and is no longer guaranteed to be available.  The user may request to extend the lifetime of a cache after it has already been committed, which the CacheD may or may not accept.

The cache also has properties similar to a job.  For example, the cache can have it's own set of requirements for which nodes it can be replicated to.  By default, a cache is initialized with the requirement that a CacheD has enough disk space to hold the cache.  Analogous to the HTCondor matching with jobs, the cache can have requirements, and the CacheD can have requirements.  A CacheD requirements may be that the node has enough disk space to hold the matched cache.  This two way matching guarantees that any local policies are enforced.

The requirements attribute is especially useful when the user aligns the cache's requirements with the jobs that require the data.  For example, if the user knows that their processing requires nodes with 8GB of ram available, then there is no point is replicating the cache to a node with less than 8GB of ram.

The CacheD

The CacheD is the daemon that manages caches on the local node.  Each CacheD is considered a peer to all other CacheD's, there is no further coordination daemon.  Each cache serves multiple functions:
  1. Respond to user requests to create, query, update, and delete caches.
  2. Send replication requests to CacheD's that match each cache's requirements.
  3. Respond to replication requests from other CacheD's.  Matching is done on the cache before transferring the data
The CacheD keeps a database storing the metadata for each cache.  The database is stored using the same techniques as the SchedD uses for jobs to maintain a durable database store.  It also maintains a directory containing all of the caches stored on the node.

The CacheD's user interface is primarily through python bindings, at least for the time being.

CacheD Usage

The CacheD is used in conjunction with glideins.  The CacheD is started along with other glidein daemons such as the HTCondor StartD.
Initialization of cache as well as initial replication requests
The user initializes the cache by creating and uploading it to the user's submit machine.  The CacheD connects to remote CacheD's, sending replication requests.

The BitTorrent communication between nodes after accepting the replication request
Once the CacheD's accept the replication request(s), BitTorrent protocol allows for communication between all nodes inside the cluster, as well with the user's submit machine.  This graph only shows a single cluster, but this could be replicated to many clusters as well.

In Action 

Partial graph showing data transfers.  Due to overflowing the event queue, not all downloads are captured.
The above graph shows the data transfer using the BitTorrent protocol between the nodes that have accepted the replication request and the Cache Origin, which is an external node.  In this example, only 5 remote CacheD's where started on the cluster.  Because of all of the traffic between nodes, this level of detail graph becomes unreadable very quickly when increasing the number of remote CacheD's.

You will notice that the Cache Origin only transfers to 2 nodes inside the cluster.  The BitTorrent protocol is complicated and difficult to predict, therefore this could be caused by many factors.  For example, the two nodes could have found the CacheD origin first, therefore being the first nodes to download it.  The other nodes would then have found the internal cluster nodes with portions of the cache, and begun to download from it.

It is important to note that even though the ~15GB cache is transferred to all 5 nodes, totalling 75GB of transferred cache, only ~15Gb is transferred from the cache origin, and all of the rest of the transfers are between nodes in the cluster.

Up Next

In Part 3 of the series, I will look at timings of the transfers using data analysis of trial runs.  As a hint, the BitTorrent protocol is slower than direct transfers for 1 to 1 transfers.  But it really shines when increasing the number of downloaders and seeders.

Thursday, January 22, 2015

Condor CacheD: Caching for HTC - Part 1

A typical job flow in the Open Science Grid is as follows:
  1. Stage input files to worker node.
  2. Start processing...
  3. Stage output files back to submit host.
In an ideal world, step 2 would take the longest.  But, as data sizes increase, so to does the stage-in and stage-out of the files (primarily the stage-in).  Additionally, we continue to recommend the same maximum job length to users, around 8 hours.

Lets use a real world example.  The nr blast database (Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr) (source) is currently ~15GB.  When running blast, each query needs to run over the entire nr database, therefore it is required on every worker node that will run the query.

It is easy to say... well a 1 Gbps connection can transfer 15GB in ~120 seconds.  Two minutes doesn't seem unreasonable to stage-in data, especially if the job can be 8 hours long.  But usually you have many queries, so you will want to run these queries across many jobs.  So lets say you submit 10 jobs, each needing the 15GB.  Well, that should mean that if they all transfer at the same time, it will take 20 minutes of stage in time before any processing begins.  But that is only for 10 jobs, what if you are submitting 1,000 jobs, or 10,000 jobs?  At 1,000 jobs, it takes 33 hours to transfer data for input files?  Suddenly 2 minutes to transfer a database becomes hours.  And your submit machine is doing nothing but transferring input files!  Also, using some math (or a simple simulation), you can see that if the jobs start sequentially, you are limited on the number of jobs that can run simultaneously.

This increasing data size and static maximum job length have lead to compromises and innovations on the part of users and sites.


Some users and sites have attempted to solve the larger input files.  They can be typically broken into two categories:
  1. Bandwidth increases from the storage nodes.
  2. Caching near the execution host.

Lots O' Bandwidth

OSG Connect's Stash attempts to ease the input file stage problem by providing a storage service with lots of bandwidth (10 Gbps), and support for site caching through HTTP.  Certainly the higher bandwidth solution is the brute force method of decreasing the transfer time for files.  But, 10 Gbps only knocks off a factor of 10 from all of the above times.  This will certainly decrease the transfer time, but only if you can use all 10 Gbps.

Numerous sites and users have tried to use high bandwidth storage services to solve the stage-in problem.  Nebraska (and many other sites) even have their storage services connected to 100 Gbps network connections.  But they tend to be limited not by the bandwidth available to the storage device, but by the bandwidth bottleneck at the boundary of each cluster to the outside world, usually a NAT.


With Stash's HTTP interface, users can use local site caching.  When the site caching was designed, it was meant to be used for calibration data for detectors.  This tends to be rather small data that is frequently accessed, the perfect use of a HTTP cache.  But what about when you want to transfer 15GBs of files through the HTTP cache?  A typical site may only have a few cache servers, therefore limiting the bandwidth available to download not on the available bandwidth of the hosting server, but the available bandwidth on the cache servers.  I don't know of many sites that are putting 10Gbps connections on their caching servers.

Additionally, site caching simply runs from the submit host to the remote caching host.  In the last week, the 95% of OSG VO's CPU hours have been provided by ~25 unique sites (source, calculations).  This analogous to increasing the bandwidth of the submit host by 25 times (assuming 1gbps connections standard).  This is a very cheap way to increase the bandwidth available to transfer  input files.  But the VO's usage is not split evenly amongst all 25 sites.  6 sites account for ~50% of the OSG VO usage.  On those 6 clusters, the transfer bandwidth is limited to what those 6 proxy servers can transfer to their cluster's nodes.  Therefore, for 50% of your processing slots, you are only increasing the transfer speed by 6.

The HTTP protocol and HTTP caching is not designed for such large files.  HTTP will always be designed for the dominant users, browsers downloading relatively small webpages.  Software designed to use HTTP are optimized for web sites, lots of small files.  Therefore, any software that might be used in conjunction with HTTP may not be ideal for large files.

A New Hope

Part of my PhD has been to develop a new way to handle large stage-in datasets.  In the above problem statement, two areas where most often used to optimize transfer times, bandwidth and caching.  My attempts to optimize both of these approaches using a daemon named the HTCondor CacheD.


As noted above, HTTP is great for it's designed use: websites with lots of small files.  But it is not designed or well optimized for larger files.  Further, the bandwidth from the execution host to the storage servers are typically bottlenecked at the cluster boundary.   Therefore another protocol was chosen that could better handle large files, BitTorrent.

BitTorrent was chosen since it contains many characteristics that make it ideal for transferring large files.  In order to bypass the network bottlenecks in the remote clusters, BitTorrent allows for clients to transfer data between them while also downloading from the original source.  This allows every worker node to become a cache for the rest of the cluster nodes.  As we will see in the next post, BitTorrent works well because the vast majority of the traffic is between nodes inside the cluster, rather than to nodes outside the cluster.


The caching described above utilized a single cache on each site.  From the usage breakdown, you can see that VO's that rely on few sites will find this a bottleneck.  For campus users, which may only use 1 or 2 clusters, this can be as bad of a bottleneck as not using a cache at all.  Also, this requires the site to have a caching server setup, which requires administrator cooperation.

The CacheD instead uses local caches on each worker node.  This local cache allows very fast transfers when the jobs begin.  Further, the local cache acts as a seeder for the BitTorrent transfers described above.

When a job is begins running on a node, the stage-in stage will request from the local cache a copy of the data files.  If the data files are not already staged at the local node, the local cache will pull the data files using BitTorrent from the submit host (cache origin) and all other nodes in the cluster.  When the local cache has cached the stage-in data files, it will transfer the cached files into the jobs sandbox and begin processing.  Subsequent jobs that require the same stage-in data files will request then immediately receive the files since they are already cached locally.

Up Next

In the next post, I will describe the design of the CacheD in more detail.  Further, I will show usage of the CacheD with blast databases, and the improvement in job startup time resulting in optimized stage-in transfers.

UPDATE: Link to Part 2, Architecture of the CacheD.

Wednesday, June 25, 2014

GPUs on the OSG

For a while, we have heard of the need for GPU's on the Open Science Grid.  Luckily, HCC hosts numerous GPU resources that we are willing to share.  And with the release today of HTCondor 8.2, we wanted to integrate OSG GPU resources transparently to the OSG.


Submission to the GPU resources at HCC uses the HTCondor-CE.  The submission file is shown below (gist):

You may notice that it is specifying Request_GPUs to specify the number of GPUs the job requires. This is the same command used when running with native (non-grid) HTCondor.  You may submit with the line Request_GPUs = X (up to 3) to our Condor-CE node, as each GPU node has exactly 3 GPUs.

Additionally, the OSG Glidein Factories have a new entry point, CMS_T3_US_Omaha_tusker_gpu, which is available for VO Frontends to submit GPU jobs.  Email the glidein factory operators to enable GPU resources for your VO.

Job Structure

The CUDA libraries are loaded automatically into the job's environment.  Specifically, we are running CUDA libraries version 6.0.

We tested submitting a binary compiled with 5.0 to Tusker.  It required a wrapper script in order to configure the environment, and to transfer the CUDA library with the job.  Details are on the gist along with the example files.

Lessons Learned

Things to note:
  • If a job matches more than 1 HTCondor-CE route, then the router will round robin between the routes.  Therefore, it is necessary to modify all routes if you wish specific jobs to go to a specific route.
  • Grid jobs do not source /etc/profile.d/ on the worker node.  I had to manually source those files in the file in order to use the module command and load the CUDA environment.


  • HTCondor-CE Job Router config in order to route GPU jobs appropriately.
  • HTCondor-CE PBS local submit file attributes file that includes the source and module commands.

Wednesday, April 9, 2014

Part 1: Features of the Chrome App - Embedding

In my previous post, I introduced my OSG Chrome App.  In this post, I want to introduce one feature of the chrome app, Embedding profiles.

Part of my morning routine is checking multiple websites for the status of the resources at Nebraska.  For example, I check our dashboard page that I've written about before.  And I check the HCC GlideinWMS status page.

The HCC GlideinWMS status page shows only the monitoring information.  In order to see any accounting data, I have to navigate to the GratiaWeb site and filter for hcc usage.  Instead I want to view the accounting data on the same page as I am viewing GlideinWMS status.  And so I added the ability to Embed a set of graphs using the Chrome App.

Creating a HCC GlideinWMS Profile

First, I need to create a glideinwms profile for hcc.  I have 2 questions that I want answered, who is running, and where are they running.  In this case, I am playing the role of a VO Manager.  I want to see usage filtered by VO, specifically my VO, HCC.  The VO Manager role gives me 2 graphs by default, total usage per VO, and where the VO is running.  But it doesn't include what users are running in my VO.  Therefore, I want to add a new graph, the Glidein (since HCC uses GlideinWMS) per user graph, which is on gratiaweb.

To add a graph, I followed the documentation to add the Glidein per user graph.  I copied and pasted the graph URL into the box, and it added the graph to my list.
HCC Usage with new graph.  Not much usage...

Embedding the Profile

Now that I have the profile showing the data I want to see, I want to embed this data in my webpage.  I followed the documentation again, for sharing the graphs and embedding it.

I clicked the green share button at the top, and entered the name and description of the profile.  Then I submitted the profile for sharing.  It returned an embed link that I can use on the webpage.

Share profile showing embed URL

Once I have this embed URL, I can copy / paste that HTML into the HCC GlideinWMS page and it will show HCC data every time I visit.

HCC Usage embedded on our own website

And that's it.


Friday, April 4, 2014

OSG Usage Chrome App

Hi, I'm Derek Weitzel.  You may remember me from such side projects as:
On today's episode, I will introduce a Chrome Application designed for the OSG's accounting system.

OSG Usage Chrome Application

The OSG Usage Viewer is a packaged chrome app designed to display accounting graphs from Gratia, the OSG's accounting system.  It also features:
The app allows you to add graphs from the OSG's Gratia web interface, to manipulate the data filters, and share with others.  Full documentation of the app is available.  Download the app today!


Wednesday, March 12, 2014

A HCC Dashboard with OSG Accounting

After the 2013 SuperComputing Conference, we found ourselves with a extra monitor at HCC.  Therefore, I set about creating a dashboard which can show the current status of HCC.

Creating the Dashboard

I have an interest in data visualization, and follow many blogs that show off new methods.  On one occasion, I saw Dashing mentioned.

Dashing is a dashboard framework made by Shopify for their own use, and released as open source.  It is mostly written in Ruby and CoffeeScript (a higher level javascript, if you can imagine).  It has a concept of jobs which fetch data and forward the data to the framework.  The data is sent to clients viewing the dashboard, where it is parsed by the Coffeescript and modeled with a combination of data bindings from batman.js, CSS with SCSS, and plain old HTML.

I wrote several jobs to retrieve data from numerous sources.  Most of the information is from HCC's local instance of OSG's Gratia accounting system.  The HCC Dashboard uses our gratia system for:
  • How many CPU hours where consumed on our resources.
  • Current usage by User (
The job to retrieve the top user's also communicates with HCC's user database to retrieve college and department information.  The storage meters use an external probe on the clusters to periodically report the used storage space of our filesystems.

Each box is an instances of a widget.  A widget is a combination of HTML, SCSS, and CoffeeScript that are used to parse and present the data.

Current Dashboard Design

Most of the information on the dashboard is in the form of monitoring.  The current number of cores used on our resources and the top users widgets use Gratia monitoring information.  The networking graph uses Ganglia.

We also include a "Hourly Price on Amazon EC2" widget.  This combines the computing, storage, and networking costs (extrapolated from current values), and displays an expected price per hour on Amazon.  The computing is easily the most expensive component.

Who Uses it?

HCC uses it to display the current status of our computing center.  It is useful to see if anything is working incorrectly.  For example, we where able to spot problems on one of our clusters when the number of running cores decreased significantly, which was caused by the scheduler draining off a significant portion of the cluster in order for a single user to run a toy job.

The top users is also interesting for HCC researchers when the come into the offices.  They are able to see their own usernames on the big display, prominently displayed.

Growing collection of visualizations

More Information

The source for the dashboard is available on Github.  Also, the live instance of the dashboard is available here.

Thursday, February 27, 2014

Moving from a Globus to an HTCondor Compute Element

A few weeks ago, we moved our opportunistic clusters, Crane and Tusker, from Globus GRAM gatekeepers to the new HTCondor-CE.  We moved to the HTCondor-CE in order to solve performance issues we experienced with the GRAM when using the Slurm scheduler.

When we switched Tusker from PBS to Slurm, we knew that we would have issues with the grid software.  With PBS, Globus would use the scheduler event generator to efficiently watch for state changes in jobs, ie idle -> running, running -> completed..  But Globus does not have a scheduler event generator for Slurm, therefore it must query each job every few seconds in order to retrieve the current status.  This caused a tremendous load on the scheduler, and on the machine.

Load graph on the gatekeeper
We switched to the HTCondor-CE in order to alleviate some of this querying load.  The HTCondor-CE provides configuration options to change how often it queries for job status, and can provide system wide throttles for job status querying.

The HTCondor-CE also provides much better transparency to aid in administration.  For example, there is no single command in Globus to view the status of the jobs.  In the HTCondor-CE, there is, condor_ce_q.  This command will tell you exactly what jobs the CE is monitoring, and what it believes is their job status.  Or if you want to know which jobs are currently transferring input files, they will have the < or > symbols for incoming or outgoing, respectively, in their job state column.

The HTCondor-CE uses the same authentication and authorization methods as Globus.  You still need a certificate, and you still need to be part of a VO.  The job submission file looks a little different, instead of gt5 as your grid resource, it is condor:
Loading ....

Improvements for the future

The HTCondor-CE could be improved.  For example, each real job has 2 entries in the condor_ce_q output.  This is due to the job routing from the incoming job to the scheduler specific job.  The condor_ce_q command could be improved to show linking between the 2 jobs, similar to the dag output of the condor_q command.

The job submission file is removed after a successful or unsuccessful submission to the local batch system (Slurm).  This can make debugging very difficult if the job submission fails for any reason.  Further, the gatekeeper doesn't propagate stdout / stderr of the submission command into the logs.

Final Thoughts

The initial impressions of the HTCondor-CE have been very good.  Since installing the new CE, we have had ~100,000 production jobs run through the gatekeeper from many different users.

And now for the obligatory accounting graphs:

Usage of Tusker as reported by GlideinWMS probes.

Wall Hours by VO on Tusker since the transition to the HTCondor-CE