Thursday, April 28, 2011

Overflowing the Tier 2

The last few days, the Nebraska CMS Tier 2 has been full. And I mean really full.

The expected wait time for some of these users (55 active users) is over a month. There are a few questions that arise from this situation:
  1. Why are so many people submitting here?
  2. How can we help users get science done faster?
After talking to people, the first question is a combination of "We have the data they want to run against", and "We have been more reliable in the past". Both are compliments.

The second question is the fun one. To answer this question, we decided to flock jobs to Purdue through our GlideinWMS frontend.

Here's a diagram of our idea.

  1. The Red gatekeepers will advertise idle jobs to the HCC GlideinWMS Frontend.
  2. The frontend will ask for glideins at the Purdue CMS clusters, submitting with a CMS certificate.
  3. When the glideins report back, the Frontend will negotiate with Red's gatekeepers to send jobs to the glideins at Purdue.
  4. Once at Purdue, the CMS analysis jobs will use Purdue's CMS software install, and XRootd to stream data from UNL.
This behavior will require a lot of configuration.

GlideinWMS Configuration

The GlideinWMS Frontend needed to add another group that would query the red gatekeepers. We named this group T2Overflow.

The glideins that start as a part of the T2Overflow group will only run jobs from the T2. We accomplished this with a new start expression in the glideins:
<attr name="GLIDECLIENT_Group_Start" glidein_publish="False" job_publish="False" parameter="True" type="string" value="(TARGET.IsT2Overflow =?= TRUE)"></attr>

Next, we want the glideins to be submitted as the CMS user, since they are CMS jobs. For this, we used Brian's certificate. We changed how we specify the proxy used to submit, from globally at the top of the frontend.xml file, to per group. If it is defined both places, the frontend/factory will round-robin between the proxies.

Next, we want to submit to the Purdue CMS cluster, this required changing our entry point match expression to:
<factory query_expr="(GLIDEIN_Site == "Purdue")&&((stringListMember("CMS", GLIDEIN_Supported_VOs)))"></factory>

Gatekeeper Configuration

The jobs that will flock to Purdue are limited by the glideins START expression listed above. Now, we want the jobs to only flock to the T2Overflow group, not other groups running on the HCC Frontend. In order to do this, we needed to change the requirements of the jobs and inside the to include:
( IS_GLIDEIN =!= true || TARGET.GLIDECLIENT_Group =?= "T2Overflow" )

This tells Condor: If we're matching to a glidein (IS_GLIDEIN = true), then only match to the T2Overflow group.

The jobs need the IsT2Overflow = true in order for the frontend to pick up the idle jobs. The IsT2Overflow attribute is also used in the START expression on the glideins to only start overflow jobs. This attribute was added to each job in the using the condor syntax:
+IsT2Overflow = false

This way, jobs are automatically disabled from overflowing, but can be easily changed by doing a condor_qedit.

Also for the jobs, we need to configure the environment. When a globus job is submitted to Condor, the Environment classad is filled with site specific attributes. When we flock to Purdue, we want to erase those attributes and instead use the ones given to us from the glidein environment.

The environment changing proved difficult. Our goal is to use an expression such as:
Env = $$([ ifThenElse(WantsT2Overflow=?=true, "", "MyEnv=yes") ])

In our testing, the environment ended up empty in both the True and False cases.

UPDATE: 4/29 9:30AM
We figured out the error with the environment, and opened a Condor ticket. In the mean time, we figured out a work around:
Environment = "$$([ ifThenElse(IsT2Overflow =?= TRUE, \"IsT2Overflow=1\", \"...\") ])"
Where the ... is the original environment. This will cause the environment to stay the same if the job is executed on the Nebraska T2, but to wipe out the environment if executed on in the Overflow nodes.

We have the Overflow working for ~500 jobs now.

Saturday, April 9, 2011

Day 7 at CERN - Saturday

For breakfast, fruit and a croissant.

Brian and I walked to the bus stop at CERN, tried deciphering the bus schedule. We found the correct bus (after walking to all 3 bus stops at CERN), and rode it to the Tram. The tram ride was ~20 minutes, and took us directly to downtown Geneva.

First we found the restaurant, Cafe de Paris, that we where going to meet my cousin. It was very near the tram stop. It was now 9:45, and we where planning a noon lunch with my cousin. We therefore walked around the pedestrian square around the restaurant. I was on a mission for a Swiss watch.

We walked down to the river and saw the fountain:

On the other side of the river, we walked along the shops. We saw a very oddly named shop, Globus. For those of you in the grid community will understand the humor:

The Globus shop was just a regular K-Mart like store (but much, much more expensive).

The shopping on the south side of the lake was much more expensive than in the pedestrian square. It was also more focused to high fashion, something I'm not too interested in. Stores where dedicated to Gucci, Rolex, ...

We walked up the hill to the old city. There we found a cathedral:

This cathedral was founded in the Roman area (first century AD). There was a museum under the cathedral cataloging the history of the cathedral.

There was several layers in the museum, and it stretched the very large cathedral. It also had the skeleton of a very early leader of the cathedral. You can notice the round hole near the head of the skeleton; It was dug up some time after it was buried to be displayed.

The cathedral museum even had the original floor of the Bishop's residence. The drawing on the floor (even a little of the color) was still there, even though the floor was very warped:

After leaving the Cathedral, we got a few pictures of the magnitude of the structure we where underneath:

Those are people way up in the tower.

We walked back to the pedestrian square and the Cafe De Paris. There we had lunch with my cousin.

After lunch, I shopped a little more before purchasing a Swiss watch from a local store.

At this time, it was a little after 2, and we decided to head back to CERN for a nap.

We spent the rest of the evening sitting on the patio, eating dinner and enjoying a beautiful view of the Alps on our last day.

(I should have taken my hat off)

Friday, April 8, 2011

Day 6 at CERN - Friday

Breakfast was orange juice, fruit, and a croissant.

Morning Session
This morning was the site related meetings. We had a talk from Oliver from FNAL regarding data operations. It concluded that we should separate tape from disk servers, effectively creating a Tier-2 at the Tier-1's, and using the tape as a separate storage.

Then next talk was on glexec. It concluded that glexec is not properly installed at most sites. Actually, only 1 Tier-2 is properly installed. With figures like that, one has to think it's not the site's fault, there has been a mis-communication. Brian pointed out that there has been multiple mixed messages regarding whether glexec was important or not.

The other talks this morning where on a monitoring technology HappyFace, and minimum site size. Both of these talks had little discussion.

For lunch, I met with Dirk and a colleague of his here at CERN. I had spaghetti with tomato sauce.

Immediately after lunch, Brian and I met with Aaron D. for a tour of the 0.5 site. The 0.5 site is the location of the CMS detector (0.5 is half way around the ring).

The Control room has many monitors with information on them.

Next, we went to the assembly hall. This is a very tall building where the detector was built (in pieces) and sent down the shaft to the experiment hall.

In this picture, you can see the large scale of the cranes and the building, which is roughly 5 stories high. It's currently used as a storage area. The blue scaffolding in the pictures is half the height of the detector. The scaffolding was used to access the beam pipe (made partly of beryllium) while inserting the pixel detector (designed and built at Nebraska).

After the tour, Brian and I sat around with Chris talking about the future of CMSSW (theme for the rest of the day).

We joined the USCMS meeting at CERN. There we heard about the outlook for funding in the US: unknown.

For dinner, we went to a Indian restaurant that is frequently visited by CMS. We ordered 4 different appetizers, 8 different entrees (there where 6 of us). Each of use tried all of the food.

When we got back from dinner, we sat outside restaurant 1, discussing the budget and the future of CMSSW. Especially how we are going to deal with the ROOT IO layer.

Thursday, April 7, 2011

Day 5 at CERN - Thursday

Breakfast was Choco Puffs, Fruit, and Orange/Passion fruit juice.

Today was largely Management meetings, so I was free to play tourist at CERN. I went to the Reception area:

There I bought stuff for the family. I sure hope Tristan's shirt will fit.

I then walked past the Dome. I learned later that it opens at 10, I was an hour early.

Next, I walked by the ATLAS control room. It is positioned above the ATLAS detector, which is ~100M underground.

I wish I could visit the CMS Control room, but it's located on the other side of the ring, at 0.5.

For lunch, a few people from FNAL drove downtown to a Kabob restaurant, though no one actually had Kabobs. Instead, we had very large pita sandwiches. I had a beef pita with tomatoes, onions, and lettuce.

In the afternoon, I worked on a few OSG things (mostly debugging campus factory). I met with CernVM-FS people to discuss possibly making it into a generic library that can be used by other applications (non-fuse).

For dinner, Chris, Eric, and I walked to a restaurant towards Geneva.

I had Penne with meat sauce. Very good!

On the trip back, I took a few pictures (didn't turn out all that well). This is a picture of the road to CERN from our restaurant. You can see (fuzzy) the CERN Dome.

We also saw the new Tram for CERN.

Wednesday, April 6, 2011

Day 4 at CERN - Wednesday

Breakfast was Milk, Orange juice, Fruit, Vanilla (Natural) yogurt, and croissant.

Morning Session
This morning is Monitoring Task Force. This stuff is very interesting. I'm noticing a lot of CouchDB in the talks. I've read a little about CouchDB, it looks just like a distributed key/value store.

Talks on Xrootd and NFS4. They compared Xrootd and NFS4, both have bonuses and drawbacks. Xrootd has better vector reads, NFS4 is better at local access (maybe). Brian gave his talk on Xrootd. A lot about the demonstrator, and how it has progressed. Also the current schedule for improvements. There was a lot of discussion on monitoring, and how MonaLisa was a good base, but wasn't providing the monitoring we need for full deployment.

Lunch was pasta with tomato sauce.

Afternoon Session
LHC One talk: We are relying more on the network, so need better than best-effort network QOS.

A few more talks before we got to IO. Brian gave a good talk on the ideas he has on IO. It mostly dealt with compression and basket size for internal serialization.

There was a talk from another developer about how CMS IO was not designed for the world we're living in. There was a lot of discussion about this talk, not so much about the content of the talk, but about how CMS should look in the future.

Next was a talk from a FNAL grad student that replaced ROOT IO with google protobuffers. It was interesting, albeit probably not practical for all of CMSSW. But it did bring up a good discussion, that it's not difficult to get many times faster IO by bypassing ROOT IO. How do we approach the deficiencies of ROOT IO? Do we talk with ROOT? Do we make something an give it to ROOT? Certainly we don't want to 'own' a IO layer.

Dinner we went to a restaurant on the north side of the Geneva Valley. It had a great view of the Alps.

The restaurant was a regular french restaurant. I had a salad with ham, egg, and cheese. The main course was chicken, potatoes, and green beans. The desert was banana split with chocolate, strawberry, and vanilla ice creams. The receipt isn't very descriptive, I hope UNL will take it.

The restaurant was also a hotel.

Tuesday, April 5, 2011

Day 3 at CERN - Tuesday

This morning was mostly spent fixing my user registration and getting my CERN account. Breakfast was a slice of keish, fruit, and an orange juice.

Morning Session
I attended a morning session on the new Data access system (combines phedex, dbs, ...). It provides a web user interface and can produce JSON as well. There was a lot of talk about capabilities available on DBS but not on DAS.

For lunch, we walked over to Restaurant 1. I had rice and shrimp. It was very nice outside, weather says 61F, so we sat outside. The conversation centered around changes that could be done to ROOT.

Afternoon Session
The afternoon talks have been focused CMS's grid usage. Talks where on Phedex. But there was also a lot of talk on GlideinWMS as well. Most/All of the CMS Production is moving to GlideinWMS. Working with Igor and Burt to get the frontend and factory installed. People seem to understand GlideinWMS, but not necessarily how it all hooks into CMS Production in the US (through job router, for some reason).

There where talks on Phedex and Crab 3. Crab 3 was depreciating the local submit mode. This brought some attention from others in the room. The local submit will likely be re-considered, especially since it should be so similar to the GlideinWMS method.

Before dinner, I sat down with a few folks from FNAL (and a few from CERN) to talk over changes in ROOT with Brian. Brian had just finished his few changes to compression that decreased file size of RECO by ~25%. But, bigger news, found a Budweiser that was good. But it was from the Czech Republic.

For dinner, Brian and I ate at restaurant 1 across from our hostel. We had a very good discussion about the data flow of the CMS workflow. I learned most of where the data all goes.

Tonight I got some work done, finished up this blog post.

Monday, April 4, 2011

Day 2 At CERN

Woke up this morning at 7:30. Grabbed some breakfast across the street from the Hostel. I had orange juice with passion fruit (whatever that is), mixed fruit, and a chocolate french roll.

After this, Brian and I walked to the Users' Center to create my account. Then, we walked to the ID badge place to get my CERN ID. At this time, it was raining.

After getting my ID, we walked to meeting building, which happened to be on the other side of CERN (I walked across the Swiss/French border, can't even tell).

Morning Session
Listened to 3 talks. The first two where very CMS and physics heavy. I didn't understand much. Ian gave a talk about the future challenges in CMS computing. Punchlines where:
  • Redesigning the CAF (CERN Analysis Farm) to have better data access (integrate Xrootd redirector)
  • We're running out of Cycles at CMS T1's
  • CMS reduced the AOD (Analysis-Oriented Data) copies. More transfers between sites rather than strict adherence to tree, trickle down, theory.
Lunch was at Restaurant 2. Had pasta with white sauce and small chunks of ham on top, salad, and a Coke. The Coke, like most drinks here, was not all that cool. I guess Americans are weird with their need for ice and very cold drinks. Pasta was good, but big.

Afternoon Session
A few physics talks that I didn't quite grasp. Geant4 talk. Fast simulation talk.

Learned a lot about the CMS Production workflow.

  • Seems odd that they have to warn sites what files they will run on, shouldn't we be running better technology? This sounds like an issue with dCache? Do they do this on Hadoop Clusters? Lustre?
  • Didn't mention if these production jobs run outside the T1's. I certainly see something from CMS Production at Nebraska.
T1 Processing. Basically we're going to run out of computing after the end of the year. Especially if the data collecting is like last year, where 75% data was taken after September. This makes me wonder why CMS doesn't invest in what Miron has said before: Make it run everywhere and my toaster.

After the talk, we went to the 'CMS Drink' at a building right next to our Hostel (though it was a walk from where the meetings where held). This was just an horderve and wine event. I forgot how much I didn't like red wine. But the white wine was tolerable.

During the 'Drink' session, a few of the people that Brian knows planned dinner at a local chicken place, Chez Ma Cousine.

And a few people outside of the entrance after eating.

Sunday, April 3, 2011

Arriving at CERN

Had a decent flight into CERN. Aaron D. picked Brian and I up at the airport. The Hostel opens at 9:00, so we ate some breakfast in the cafeteria before we checked into our hostel.

After checking in, I took a quick shower and started to fight the jet lag.

We met Liz in the Hostel parking lot. She gave us a ride to Ian's house where we had brunch. Brunch consisted of grilled chicken, potatoes, and various other french food (cheese, bread, ...). The view from Ian's house of the Swiss Alps:

We stayed at Ian's until just past 3:00. After coming back, we met with a few other people from Fermilab at the patio across the street from the Hostel. View of Swiss Alps from patio:

After talking for a while, I couldn't stand it and went back to the hostel room and took a quick nap.

After the nap, I met Brian and a few other people from FNAL and Ken Bloom (UNL Physicist) to walk to dinner. We walked towards Geneva to a pizza place named Pizza D Oro. I had a normal'ish pizza. It was a thin crust beef pizza. But, the beef was just sliced beef like if you cut a steak horizontally.

After dinner, Brian and I just crashed.