Friday, June 17, 2011

Job Visualization

A year or so ago, Ian Stokes-Rees showed me a job visualization that he had put together to see how his workflows where going.  Gives a good overview per site.  Recently, we where investigating usage from a NEES user, and I adapted the aging script to visualize the workflow.  I think it turned out really well.

Note:  Be prepared to zoom in a lot.  Job ID's are printed at the beginning of execution.  Green lines are successful completions.  Red lines are Evicted or Job Disconnected jobs.

View on Google Docs

Very ugly source is on github: https://github.com/djw8605/condor_log_analyze

It's times like this I wish I had a tile wall again.


Note, I found chrome's pdf viewer faster than Preview.

Friday, June 10, 2011

HCC Walltime Effeciency

Rob's post on walltime efficiency was interesting.  I did the same with HCC and found that we had much better efficiency than I expected.  Especially with GlideinWMS's default behavior of sticking around for 20 minutes waiting for a job before exiting.

What's interesting is that we rarely run jobs at RENCI-Engagement, OSCER, or GPGRID.  It's difficult to interpret this.  Since we run so many workflows, it's not clear to me that this means much.

Need more cores, Scotty!

A researcher/grid user from UNL has come back!  Monitoring
100K+ Jobs.
Even though we have a lot of jobs idle, we are only able to get maybe 4.5k running jobs.  Doing some investigating, it looks like HCC is competing with GLOW and Engage for slots.
Special thanks to Purdue!

Barely a light load on the glidein machine.  This many jobs would have not worked on the old glidein machine, which was a Xen VM.

Of course with Condor, load and memory usage are more a function of # running jobs rather than jobs in queue.

Also, the new GlideinWMS gratia monitoring is picking up the usage:

Wednesday, June 8, 2011

Gratia DB Rates

In my recent discussions with NCSA on gratia, I found something that is probably obvious to other people, but I would still like to point out.

Status captured 6-8-11
According to the OSG PR Display, the OSG runs and accounts for 420,000 jobs daily.  That's 4.8 records a second added into MySQL JobUsageRecord table.  Also, for each job, summary data is updated in the summary table.  So that's at least 5 inserts/updates a second, but most likely many more.  All the while, we are able to interactively query the database (well, a replicated one) using pages such as the UNL graphs.

We also keep track of transfers, but they are summarized and aggregated at the probe before reaching the database.

So, in summary, kudos to the gratia operations team.

(NOTE: I may have forgotten some optimizations that we use.  But still, that's a lot of records)

Monday, June 6, 2011

Using VOMS Groups and Roles

I’ve been working on gratia graphs with Ashu Guru for the new GlideinWMS probe. I’ve stumbled upon very odd command line usage of voms-proxy-init.


In order to get a role to work with voms-proxy-init, the command needs to be:

voms-proxy-init -voms hcc:/hcc/Role=collab
This command will give the proxy the role collab in the generic group hcc.


VOMS also has the ability to use ‘groups’ in the vo to distinguish between users. In order to use a group, the syntax is:

voms-proxy-init -voms hcc:/hcc/testgroup/Role=collab
This command will give the proxy the role collab in the group testgroup of the HCC VO.


Not necessarily the most complicated command line I have used, but certainly one of the worst documented.

voms-proxy-init -help
doesn’t help at all.