Wednesday, June 25, 2014

GPUs on the OSG


For a while, we have heard of the need for GPU's on the Open Science Grid.  Luckily, HCC hosts numerous GPU resources that we are willing to share.  And with the release today of HTCondor 8.2, we wanted to integrate OSG GPU resources transparently to the OSG.

Submission

Submission to the GPU resources at HCC uses the HTCondor-CE.  The submission file is shown below (gist):

You may notice that it is specifying Request_GPUs to specify the number of GPUs the job requires. This is the same command used when running with native (non-grid) HTCondor.  You may submit with the line Request_GPUs = X (up to 3) to our Condor-CE node, as each GPU node has exactly 3 GPUs.

Additionally, the OSG Glidein Factories have a new entry point, CMS_T3_US_Omaha_tusker_gpu, which is available for VO Frontends to submit GPU jobs.  Email the glidein factory operators to enable GPU resources for your VO.

Job Structure

The CUDA libraries are loaded automatically into the job's environment.  Specifically, we are running CUDA libraries version 6.0.

We tested submitting a binary compiled with 5.0 to Tusker.  It required a wrapper script in order to configure the environment, and to transfer the CUDA library with the job.  Details are on the gist along with the example files.

Lessons Learned

Things to note:
  • If a job matches more than 1 HTCondor-CE route, then the router will round robin between the routes.  Therefore, it is necessary to modify all routes if you wish specific jobs to go to a specific route.
  • Grid jobs do not source /etc/profile.d/ on the worker node.  I had to manually source those files in the pbs_local_submit_attributes.sh file in order to use the module command and load the CUDA environment.

Resources

  • HTCondor-CE Job Router config in order to route GPU jobs appropriately.
  • HTCondor-CE PBS local submit file attributes file that includes the source and module commands.