For a while, we have heard of the need for GPU's on the Open Science Grid. Luckily, HCC hosts numerous GPU resources that we are willing to share. And with the release today of HTCondor 8.2, we wanted to integrate OSG GPU resources transparently to the OSG.
Submission
Submission to the GPU resources at HCC uses the HTCondor-CE. The submission file is shown below (gist):
You may notice that it is specifying Request_GPUs to specify the number of GPUs the job requires. This is the same command used when running with native (non-grid) HTCondor. You may submit with the line Request_GPUs = X (up to 3) to our Condor-CE node, as each GPU node has exactly 3 GPUs.
Additionally, the OSG Glidein Factories have a new entry point, CMS_T3_US_Omaha_tusker_gpu, which is available for VO Frontends to submit GPU jobs. Email the glidein factory operators to enable GPU resources for your VO.
Additionally, the OSG Glidein Factories have a new entry point, CMS_T3_US_Omaha_tusker_gpu, which is available for VO Frontends to submit GPU jobs. Email the glidein factory operators to enable GPU resources for your VO.
Job Structure
The CUDA libraries are loaded automatically into the job's environment. Specifically, we are running CUDA libraries version 6.0.
We tested submitting a binary compiled with 5.0 to Tusker. It required a wrapper script in order to configure the environment, and to transfer the CUDA library with the job. Details are on the gist along with the example files.
We tested submitting a binary compiled with 5.0 to Tusker. It required a wrapper script in order to configure the environment, and to transfer the CUDA library with the job. Details are on the gist along with the example files.
Lessons Learned
Things to note:
- If a job matches more than 1 HTCondor-CE route, then the router will round robin between the routes. Therefore, it is necessary to modify all routes if you wish specific jobs to go to a specific route.
- Grid jobs do not source /etc/profile.d/ on the worker node. I had to manually source those files in the pbs_local_submit_attributes.sh file in order to use the module command and load the CUDA environment.
Derek,
ReplyDeleteThis looks interesting. Could we access GPU resources from OSG Connect? We are interested in running Madgraph code on GPUs to simulate various exotic processes.
Rob
Hi Rob,
DeleteYes, GPU resources are open to all VO's. I believe the OSG Connect service flocks to the OSG-Flock resource, so you would need to speak to them about enabling a GPU 'queue' (group) in the VO Frontend. The endpoint is in the GlideinWMS factory.