- OpenMPI needs to have an rsh binary. Even if you are using shared memory for openmpi, and openmpi does not use rsh, it still looks for the binary and fails if it cannot find it.
- Chroots (used on HCC machines for grid jobs) do not support pty's. OpenMPI has a compile option to turn off pty support.
Once these issues where fixed, we were able to submit QE jobs to the OSG using Condor's partitionable slots on 8 cores.
Before submitting our first QE job, we had to compile OpenMPI and QE. Since we are an HPC center, we had OpenMPI compiled for our Infiniband, therefore it would always fail on the OSG where there is no Infiniband (let alone our brand and drivers).
After compiling, we created compressed files that contained the required files to run QE:
- bin.tar.gz - Only includes the cp.x file, specific to our run. It could have well included much more common pw.x.
- lib.tar.gz - Includes the Intel math libraries and libgfortran.
- openmpi.tar.gz - Includes the entire openmpi install directory (make install)
Additionally, we wrote a wrapper script, run_espresso_grid.sh, that unpacks the required files and sets the environment.
#!/bin/bash tar xzf bin.tar.gz tar xzf lib.tar.gz tar xzf pseudo.tar.gz tar xzf openmpi.tar.gz mkdir tmp export PATH=$PWD/bin:$PWD/openmpi/bin:$PATH export LD_LIBRARY_PATH=$PWD/lib:$PWD/openmpi/lib:$LD_LIBRARY_PATH export OPAL_PREFIX=$PWD/openmpi mpirun --mca orte_rsh_agent `pwd`/rsh -np 8 cp.x < h2o-64-grid.in > h2o-64-grid.out
We used GlideinWMS to submit to the OSG, below is our HTCondor submit file.
universe = vanilla output = condor.out.$(CLUSTER).$(PROCESS) error = condor.err.$(CLUSTER).$(PROCESS) log = condor.log executable = run_espresso_grid.sh request_cpus=8 request_memory = 10*1024 should_transfer_files = YES when_to_transfer_output = ON_EXIT_OR_EVICT transfer_input_files = bin.tar.gz, lib.tar.gz, pseudo.tar.gz, openmpi.tar.gz, h2o-64-grid.in, /usr/bin/rsh transfer_output_files =h2o-64-grid.out +RequiresWholeMachine=True Requirements = CAN_RUN_WHOLE_MACHINE =?= TRUE queue
Note that we pull rsh from the submission machine. OpenMPI does not actually use rsh to start the processes on a shared memory machine, but it does require that the RSH binary is available.
This was done with the tremendous help of Jun Wang.
This work is licensed under a Creative Commons Attribution 3.0 Unported License.