- Stage input files to worker node.
- Start processing...
- Stage output files back to submit host.
In an ideal world, step 2 would take the longest. But, as data sizes increase, so to does the stage-in and stage-out of the files (primarily the stage-in). Additionally, we continue to recommend the same maximum job length to users, around 8 hours.
Lets use a real world example. The nr blast database (Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr) (source) is currently ~15GB. When running blast, each query needs to run over the entire nr database, therefore it is required on every worker node that will run the query.
It is easy to say... well a 1 Gbps connection can transfer 15GB in ~120 seconds. Two minutes doesn't seem unreasonable to stage-in data, especially if the job can be 8 hours long. But usually you have many queries, so you will want to run these queries across many jobs. So lets say you submit 10 jobs, each needing the 15GB. Well, that should mean that if they all transfer at the same time, it will take 20 minutes of stage in time before any processing begins. But that is only for 10 jobs, what if you are submitting 1,000 jobs, or 10,000 jobs? At 1,000 jobs, it takes 33 hours to transfer data for input files? Suddenly 2 minutes to transfer a database becomes hours. And your submit machine is doing nothing but transferring input files! Also, using some math (or a simple simulation), you can see that if the jobs start sequentially, you are limited on the number of jobs that can run simultaneously.
This increasing data size and static maximum job length have lead to compromises and innovations on the part of users and sites.
Lets use a real world example. The nr blast database (Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr) (source) is currently ~15GB. When running blast, each query needs to run over the entire nr database, therefore it is required on every worker node that will run the query.
It is easy to say... well a 1 Gbps connection can transfer 15GB in ~120 seconds. Two minutes doesn't seem unreasonable to stage-in data, especially if the job can be 8 hours long. But usually you have many queries, so you will want to run these queries across many jobs. So lets say you submit 10 jobs, each needing the 15GB. Well, that should mean that if they all transfer at the same time, it will take 20 minutes of stage in time before any processing begins. But that is only for 10 jobs, what if you are submitting 1,000 jobs, or 10,000 jobs? At 1,000 jobs, it takes 33 hours to transfer data for input files? Suddenly 2 minutes to transfer a database becomes hours. And your submit machine is doing nothing but transferring input files! Also, using some math (or a simple simulation), you can see that if the jobs start sequentially, you are limited on the number of jobs that can run simultaneously.
This increasing data size and static maximum job length have lead to compromises and innovations on the part of users and sites.
Innovations
Some users and sites have attempted to solve the larger input files. They can be typically broken into two categories:
- Bandwidth increases from the storage nodes.
- Caching near the execution host.
Lots O' Bandwidth
OSG Connect's Stash attempts to ease the input file stage problem by providing a storage service with lots of bandwidth (10 Gbps), and support for site caching through HTTP. Certainly the higher bandwidth solution is the brute force method of decreasing the transfer time for files. But, 10 Gbps only knocks off a factor of 10 from all of the above times. This will certainly decrease the transfer time, but only if you can use all 10 Gbps.Numerous sites and users have tried to use high bandwidth storage services to solve the stage-in problem. Nebraska (and many other sites) even have their storage services connected to 100 Gbps network connections. But they tend to be limited not by the bandwidth available to the storage device, but by the bandwidth bottleneck at the boundary of each cluster to the outside world, usually a NAT.
Caching
With Stash's HTTP interface, users can use local site caching. When the site caching was designed, it was meant to be used for calibration data for detectors. This tends to be rather small data that is frequently accessed, the perfect use of a HTTP cache. But what about when you want to transfer 15GBs of files through the HTTP cache? A typical site may only have a few cache servers, therefore limiting the bandwidth available to download not on the available bandwidth of the hosting server, but the available bandwidth on the cache servers. I don't know of many sites that are putting 10Gbps connections on their caching servers.
Additionally, site caching simply runs from the submit host to the remote caching host. In the last week, the 95% of OSG VO's CPU hours have been provided by ~25 unique sites (source, calculations). This analogous to increasing the bandwidth of the submit host by 25 times (assuming 1gbps connections standard). This is a very cheap way to increase the bandwidth available to transfer input files. But the VO's usage is not split evenly amongst all 25 sites. 6 sites account for ~50% of the OSG VO usage. On those 6 clusters, the transfer bandwidth is limited to what those 6 proxy servers can transfer to their cluster's nodes. Therefore, for 50% of your processing slots, you are only increasing the transfer speed by 6.
Additionally, site caching simply runs from the submit host to the remote caching host. In the last week, the 95% of OSG VO's CPU hours have been provided by ~25 unique sites (source, calculations). This analogous to increasing the bandwidth of the submit host by 25 times (assuming 1gbps connections standard). This is a very cheap way to increase the bandwidth available to transfer input files. But the VO's usage is not split evenly amongst all 25 sites. 6 sites account for ~50% of the OSG VO usage. On those 6 clusters, the transfer bandwidth is limited to what those 6 proxy servers can transfer to their cluster's nodes. Therefore, for 50% of your processing slots, you are only increasing the transfer speed by 6.
The HTTP protocol and HTTP caching is not designed for such large files. HTTP will always be designed for the dominant users, browsers downloading relatively small webpages. Software designed to use HTTP are optimized for web sites, lots of small files. Therefore, any software that might be used in conjunction with HTTP may not be ideal for large files.
A New Hope
Part of my PhD has been to develop a new way to handle large stage-in datasets. In the above problem statement, two areas where most often used to optimize transfer times, bandwidth and caching. My attempts to optimize both of these approaches using a daemon named the HTCondor CacheD.
Bandwidth
As noted above, HTTP is great for it's designed use: websites with lots of small files. But it is not designed or well optimized for larger files. Further, the bandwidth from the execution host to the storage servers are typically bottlenecked at the cluster boundary. Therefore another protocol was chosen that could better handle large files, BitTorrent.
BitTorrent was chosen since it contains many characteristics that make it ideal for transferring large files. In order to bypass the network bottlenecks in the remote clusters, BitTorrent allows for clients to transfer data between them while also downloading from the original source. This allows every worker node to become a cache for the rest of the cluster nodes. As we will see in the next post, BitTorrent works well because the vast majority of the traffic is between nodes inside the cluster, rather than to nodes outside the cluster.
Caching
The caching described above utilized a single cache on each site. From the usage breakdown, you can see that VO's that rely on few sites will find this a bottleneck. For campus users, which may only use 1 or 2 clusters, this can be as bad of a bottleneck as not using a cache at all. Also, this requires the site to have a caching server setup, which requires administrator cooperation.
The CacheD instead uses local caches on each worker node. This local cache allows very fast transfers when the jobs begin. Further, the local cache acts as a seeder for the BitTorrent transfers described above.
When a job is begins running on a node, the stage-in stage will request from the local cache a copy of the data files. If the data files are not already staged at the local node, the local cache will pull the data files using BitTorrent from the submit host (cache origin) and all other nodes in the cluster. When the local cache has cached the stage-in data files, it will transfer the cached files into the jobs sandbox and begin processing. Subsequent jobs that require the same stage-in data files will request then immediately receive the files since they are already cached locally.
The CacheD instead uses local caches on each worker node. This local cache allows very fast transfers when the jobs begin. Further, the local cache acts as a seeder for the BitTorrent transfers described above.
When a job is begins running on a node, the stage-in stage will request from the local cache a copy of the data files. If the data files are not already staged at the local node, the local cache will pull the data files using BitTorrent from the submit host (cache origin) and all other nodes in the cluster. When the local cache has cached the stage-in data files, it will transfer the cached files into the jobs sandbox and begin processing. Subsequent jobs that require the same stage-in data files will request then immediately receive the files since they are already cached locally.
Up Next
In the next post, I will describe the design of the CacheD in more detail. Further, I will show usage of the CacheD with blast databases, and the improvement in job startup time resulting in optimized stage-in transfers.
UPDATE: Link to Part 2, Architecture of the CacheD.
UPDATE: Link to Part 2, Architecture of the CacheD.
Looks intriguing! Looking forward to the next post.
ReplyDelete