Thursday, April 28, 2016

Moving to a new Blogging Website

I wanted to write a quick note.  I am moving to a new blogging platform at https://djw8605.github.io/.  Please update any RSS or Atom feeds you may have.

Wednesday, April 20, 2016

Querying an Elasticsearch Cluster for Gratia Records

For the last few days I have been working on email reports for GRACC, OSG's new prototype accounting system.  The source of the email reports are located on Github.

I have learned a significant amount about queries and aggregations for ElasticSearch.  For example, below is the query that counts the number of records for a date range.

The above query searches for queries in the date range specific, and counts the number of records.  It uses the Elasticsearch-dsl python library.  It does not return the actual records, just a number.  This is useful for generating raw counts and a delta for records processed over the last few days.

The other query I designed is to aggregate the number of records per probe.  This query is designed to help us understand differences in specific probe's reporting behavior.

This query is much more complicated than the simple count query above.  First, it creates a search selecting the "gracc-osg-*" indexes.  It also creates an aggregation "A" which will be used later to aggregate by the ProbeName field.

Next, we create a bucket called day_range which is of type range.  It aggregates in two ranges, the last 24 hours and the 24 hours previous to that.  Next, we attach our ProbeName aggregation "A" defined above.  In return we get an aggregation for each of the ranges, for each of the probes, how many records exist for that probe.

This nested aggregation is a powerful feature that will be used in the summarization of the records.

Tuesday, March 22, 2016

Fedora copr: Slurm per job tmp directories!

A Slurm site on the OSG was having problems with /tmp filling up occasionally.  Our own clusters have a Slurm Spank plugin that creates a per job tmp directory.  This per job directory is useful for grid jobs which may be preempted without time to clean up after themselves.  The motivation for this plugin came from HTCondor, where each job has it's own job directory.

In order to share this plugin, I started a few Github repos (slurm-tmpdir and slurm-plugins-lua) for the plugin and dependencies.  Next, I used Fedora's Copr build system to build a repo that I could share with others, the Slurm-Plugins repo.
Copr Slurm-Plugins Project

I'm sure that I don't have an exhaustive list of plugins in my repo.  Actually, I only have one.  But I can always add more by simply adding a Github repo.

Copr was very easy to use to build my packages, and it works for all OS's that I care about.  I'm very happy with how Copr worked for sharing Nebraska's Slurm-Plugins.

Wednesday, October 14, 2015

The StashCache Tester

StashCache is is a framework to distribute user data from an origin site to an job.  It uses a layer of caching, as well as the high performance XRootD service in order to distribute the data.  It can source data from multiple machines, such as the OSG Stash.
StashCache Architecture (credit: Brian Bockelman AHM2015 Talk)
In order to visualize the status of StashCache, I developed the StashCache-Tester.  The tester runs every night and collects data from submitting test jobs to multiple sites.
The website visualizes the data received from these tests.  It shows three visualizations:
  1. In the top left is a table of the average download speed across multiple test jobs for each site.  They are colored with green being the best, and red the worst.  Also, if the test has not be able to run at a particular site for several days, it will show the last successful test, but fade the color accordingly as the test gets older.  An all-white background means that the test hasn't been conducted for three or more days.
  2. In the top right is a bar graph comparing the average transfer rates for multiple sites.  This method better visualizes the sites.
  3. On the bottom, we have a historical graph showing the last month of recorded data.  You can see that some sites have large peaks of download speeds.  Additionally, some sites are very infrequently tested, such as Nebraska (which is the CMS Tier-2).  Infrequent testing can be caused by an overloaded site that is unable to run the test jobs.
In the future, I want to add graphs comparing the performance of individual caches in addition to the existing site comparisons.  Further, I would like to add many more sites to be tested.

Friday, August 7, 2015

GPUs and adding new resources types to the HTCondor-CE

In the past, it has been difficult to add new resource types to the OSG CE, whether it was a Globus GRAM CE or a HTCondor-CE.  But, it has just gotten a little bit easier.  Today I added HCC's GPU resources to the OSG with this new method.

With a (yet unapproved) pull request, the HTCondor-CE is able to add new resource types by modifying only 2 files, the routes table and scheduler attributes customization script.  Previously, it required editing a third python script which had very tricky syntax (python, which spit out ClassAds...).  In the following examples, I will demonstrate how to use this new feature with GPUs.

The Routes

Each job submitted to a HTCondor-CE must follow a route from the original job, to the final job submitted to the local batch system.  The HTCondor JobRouter is in charge of translating the original job to the final job, according to rules specified in the router configuration.  Crane's GPU route is:

The route submit the job to the local PBS (actually Slurm) scheduler to the grid_gpu partition.  Further, it adds a special new attribute:
default_remote_cerequirements = "RequestGpus == 1"
This attribute is used in the next section, the local submit attributes script.


Local Submit Attributes Script

The local submit attributes script translates the remote_cerequirements to the actual scheduler language used at the site.  For Crane's GPU configuration, the snippet added for GPUs is:

This snippet checks for the existence of the RequestGpus attribute from the environment, and if detected, will insert several lines into the submit script.  It will first add the SLURM line to request a GPU, then it will source the module setup script and load the cuda module.

Next Steps

The next steps for using GPUs on the OSG is to use one of the many frontends that are capable of submitting glideins to the GPU resources at HCC.  Currently, the HCC, OSG, OSGConnect, and GLOW frontends are capable of submitting to the GPU resources.

Tuesday, July 28, 2015

The more things change, the more they stay the same

A lot has happened since I last posted in January.

  1. I have successfully defended and submitted my dissertation: Enabling Distributed Scientific Computing on the Campus.  I will formally graduate on August 15.
  2. I have been offered, and accepted, a position with the University of Nebraska - Lincoln Holland Computing Center.  I will be working with the Open Science Grid's software & investigations team.
  3. On a personal note, I am now engaged.
  4. And I am moving later this year to the HCC Omaha office.
Now that I have graduated, I hope to write more blog posts about what I am doing, as well as what is happening in the OSG teams that I am working with.

Sunday, January 25, 2015

HTCondor CacheD: Caching for HTC - Part 2

In the previous post, I discussed why we decided to make the HTCondor CacheD.  This time, we will discuss the operation and design of the CacheD, as well as show an example utilizing a BLAST database.

It is important to note that the CacheD is still very much "dissertation-ware."  It functions enough to demonstrate the improvements, but not enough to be put into production.

The Cache

The fundamental unit that the CacheD works with is an immutable set of files in a cache.  A user creates and uploads files into the cache.  Once the upload is complete, the cache is committed and may not be altered at any time.  The cache has a set of metadata associated with it as well, stored as classads in a durable storage database (using the same techniques as the SchedD job queue).

The cache has a 'lease', or an expiration date.  This lease is a given amount of time that the cache is guaranteed to be available from a particular CacheD.  When creating the cache, the user provides a requested cache lifetime.  The CacheD can either accept or reject the requested cache lifetime.  Once the cache's lifetime expires, it can be deleted by the CacheD and is no longer guaranteed to be available.  The user may request to extend the lifetime of a cache after it has already been committed, which the CacheD may or may not accept.

The cache also has properties similar to a job.  For example, the cache can have it's own set of requirements for which nodes it can be replicated to.  By default, a cache is initialized with the requirement that a CacheD has enough disk space to hold the cache.  Analogous to the HTCondor matching with jobs, the cache can have requirements, and the CacheD can have requirements.  A CacheD requirements may be that the node has enough disk space to hold the matched cache.  This two way matching guarantees that any local policies are enforced.

The requirements attribute is especially useful when the user aligns the cache's requirements with the jobs that require the data.  For example, if the user knows that their processing requires nodes with 8GB of ram available, then there is no point is replicating the cache to a node with less than 8GB of ram.

The CacheD

The CacheD is the daemon that manages caches on the local node.  Each CacheD is considered a peer to all other CacheD's, there is no further coordination daemon.  Each cache serves multiple functions:
  1. Respond to user requests to create, query, update, and delete caches.
  2. Send replication requests to CacheD's that match each cache's requirements.
  3. Respond to replication requests from other CacheD's.  Matching is done on the cache before transferring the data
The CacheD keeps a database storing the metadata for each cache.  The database is stored using the same techniques as the SchedD uses for jobs to maintain a durable database store.  It also maintains a directory containing all of the caches stored on the node.

The CacheD's user interface is primarily through python bindings, at least for the time being.

CacheD Usage

The CacheD is used in conjunction with glideins.  The CacheD is started along with other glidein daemons such as the HTCondor StartD.
Initialization of cache as well as initial replication requests
The user initializes the cache by creating and uploading it to the user's submit machine.  The CacheD connects to remote CacheD's, sending replication requests.

The BitTorrent communication between nodes after accepting the replication request
Once the CacheD's accept the replication request(s), BitTorrent protocol allows for communication between all nodes inside the cluster, as well with the user's submit machine.  This graph only shows a single cluster, but this could be replicated to many clusters as well.

In Action 

Partial graph showing data transfers.  Due to overflowing the event queue, not all downloads are captured.
The above graph shows the data transfer using the BitTorrent protocol between the nodes that have accepted the replication request and the Cache Origin, which is an external node.  In this example, only 5 remote CacheD's where started on the cluster.  Because of all of the traffic between nodes, this level of detail graph becomes unreadable very quickly when increasing the number of remote CacheD's.

You will notice that the Cache Origin only transfers to 2 nodes inside the cluster.  The BitTorrent protocol is complicated and difficult to predict, therefore this could be caused by many factors.  For example, the two nodes could have found the CacheD origin first, therefore being the first nodes to download it.  The other nodes would then have found the internal cluster nodes with portions of the cache, and begun to download from it.

It is important to note that even though the ~15GB cache is transferred to all 5 nodes, totalling 75GB of transferred cache, only ~15Gb is transferred from the cache origin, and all of the rest of the transfers are between nodes in the cluster.

Up Next

In Part 3 of the series, I will look at timings of the transfers using data analysis of trial runs.  As a hint, the BitTorrent protocol is slower than direct transfers for 1 to 1 transfers.  But it really shines when increasing the number of downloaders and seeders.