Derek Weitzel: 2011

Tuesday, October 25, 2011

September Progress Report

Since September was a while ago, I'll keep this short. Most of Sept. was figuring out the class/work/research schedule. It had been a 2 years since I've taken anywhere near a full load of classes, and it's funny how quick you forget the work a class takes.

OSG Software
During September, I continued to help with the OSG software effort. By the beginning of September, I had handed most of my tasks off to others, but I still contributed here and there. I especially contributed to discussions on the software that I had built, this was especially evident for the Condor build in the OSG repositories.

For the OSG Condor build, I backported the 7.7 builds from the Fedora distribution. There was some discussion on the method I choose to build the Condor version was the best. I would argue it is the best of a bad situation. There is no 'proper' build of Condor for RHEL5. This is very much on purpose, since Redhat won't allow a EL5 build of Condor in EPEL since it's distributed in their MRG product. Also, the RPM's that the Condor team produce do not conform to the OSG software standards. They are statically linked against a lot of libraries. They don't have source RPMs. They also just recently started putting things in the right locations, /etc, /usr/bin...

But, by backporting the Fedora build, we didn't get CREAM support. We never had CREAM support before in the VDT, but we would still like it, since it's in the 'binary blob' Condor builds. We would need a properly packaged CREAM client in order to do this.

I'm also not a fan of removing Condor from the osg-client. Condor has been a member of the OSG client since the very beginning. Condor-G is the base method for submitting globus jobs for almost all systems we use today. I can imagine a user downloading the osg-client, then wanting to run a job. But they can't, because they need to install Condor too (which is easy, but still).

SuperComputing Conference Prep
In the month of September, I started developing and forming my idea for Supercomputing visualization. If you don't know, I LOVE visualizations. Especially when it explains a very complex system in an understandable way. Self promotion: Youtube channel

The first task for SuperComputing was to add GLOW usage to the google earth display. While I was at it, I added documentation on how to install and run the OSG google earth display on the twiki.

There where no great ideas on how to expand the current google earth display. We could add transfers, but we don't have good transfer accounting on a per-VO level. Plus, not many VO's do file transfers outside of CMS and ATLAS. Another idea was to incorporate globus online traffic. But again, I don't think the number of transfers, especially to and from OSG sites, is high enough to show that traffic. Maybe one day it will...

So, I turned to something that I had wanted to learn for a while, Android. In the course of a weekend, I built an OSG Android App. After demonstrating the app at HCC for a week, I was able to purchase a Android tablet that will be interactively displayed at SuperComputing. The goal of the application is to provide interactive status of the OSG. Either at the site level, or by VO.

Campus Grids
We got Bill from VT running through the Engage submit host. He is flocking to the Engage submit host, and then out to the OSG, all while using whole machine slots. Additionally, he's flocking to campus factories that are also running whole machine jobs. He was very quick to get it setup, and then start running actual science through the system. His usage is currently monitored by gratia. He hasn't been real active lately, you may need to expand the start and end dates. We should thank the Renci admins Brad and Jonathan for their with the setup, they have been super fast learners of the glideinwms system. And, maybe more importantly, willing to experiment!

There where a few issues this month with Gratia collection. It mostly boils down to mis-understandings on what we are accounting, what is possible (practical) to account (not the same as what we are accounting), and what counts as an 'OSG job'.

We enabled Florida running with the campus factory. They had it setup before, but something had changed and caused some held jobs. It turned out to be harmless, but still a sign of how multiple layers can complicate the system. At least the users only see 1, Condor.

Cloud Platform
When I got back after spending the summer at FNAL, there where a few changes at HCC. First, we had a 'cloud' platform that was just starting to take shape. HCC built a Openstack prototype on a few nodes. I helped beta test the cloud, figuring out a few bugs. I'm happy to report that it is now 'just working'. It has been especially useful when I want to try out new software quickly. For example, last week I quickly installed CEPH distributed file system, and found it usable. It also has been very useful for quick install tests of OSG software.

Misc.
Started using the GOC factory. Very happy to have some fault tolerance on our GlideinWMS install. Though in practice, UCSD has been very stable, it's a nice piece of mind.

New user on our glideinwms machine. Monica from Brown university running some Neutrino experiments (I'm not a physicist!) . For now she's running as HCC, but that may change in the future if usage increases.

Various Grid administration of the HCC site. I've been transitioning off of administrating most of the HCC resources that I used to maintain, freeing me for class work mostly.

Friday, October 21, 2011

SC11 Visualization Prep

Last year I put together a visualization for SuperComputing 10' to show jobs moving over the OSG.

This visualization has since played at HCC on the tile wall. Here's a video from youtube.

I will be showing it again this year at SC11 in Seattle. I also will be showing my OSG android tablet application, which is currently on the App Store:

Instructions for installing and running this visualization are at:
https://twiki.grid.iu.edu/bin/view/Main/InstallingOSGGoogleEarth

Friday, October 14, 2011

CEPH on Fedora 15

Yesterday, I read a blog post using CEPH for a backend store for virtual machine images. I've heard a lot about ceph in the last year, especially after it was integrated into the mainline kernel in 2.6.34. So I thought I'd give it a try.

Before I get into the install, I want to summarize my thoughts on Ceph. I think it has a lot of potential, but parts of it are trying too hard to do everything for you. I always think there is a careful balance between a program doing too much for you, and making you do too much. For example, the mkcephfs script that creates a ceph filesystem will ssh to all the worker nodes (defined in ceph.conf) and configure the filesystem. If I was in operations, this would scare me.

Also, the keychain configuration is overly complicated. I think the Ceph is designed to be secure over the WAN (secure, not encrypted), so maybe it's needed. But it seems overly complicated when you compare it to other distributed file systems (Hadoop, Lustre).

On the other hand, I really like the full posix compliant client, especially since it's in the mainline kernel. It is too bad that it was added in 2.6.34 rather than 2.6.32 (RHEL 6 kernel). I guess we'll have to wait 2 years for RHEL 7 to have it in something we can use in production.

Also, the distributed metadata and multiple metadata servers are interesting aspects to the system. Though, in the version I tested, the MDS crashed a few times (the system picked it up and compensated).

On Fedora 15, ceph packages are in the repos.

yum install ceph

The configuration I settled on was:

[global]
    auth supported = cephx
    keyring = /etc/ceph/keyring.admin

[mds]
    keyring = /etc/ceph/keyring.$name
[mds.i-00000072]
    host = i-00000072
[mds.i-00000073]
    host = i-00000073
[mds.i-00000074]
    host = i-00000074

[osd]
    osd data = /srv/ceph/osd$id
    osd journal = /srv/ceph/osd$id/journal
    osd journal size = 512
    osd class dir = /usr/lib64/rados-classes
    keyring = /etc/ceph/keyring.$name
[osd0]
    host = i-00000072
[osd1]
    host = i-00000073
[osd2]
    host = i-00000074

[mon]
    mon data = /srv/ceph/mon$id
[mon0]
    host = i-00000072
    mon addr = 10.148.2.147:6789
[mon1]
    host = i-00000073
    mon addr = 10.148.2.148:6789
[mon2]
    host = i-00000074
    mon addr = 10.148.2.149:6789

As you can read from the configuration file, all files are stored in /srv/ceph/... You will need to make this directory on all your worker nodes.

Next I needed to create a keyring for authentication with the client/admin/dataservers. The keyring tool is distributed with Ceph, and is called cauthtool. Even now, it's not clear to me how to use this tool, or how Ceph uses the keyring. First you need to make a caps (capabilities?) file:

osd = "allow *"
mds = "allow *"
mon = "allow *"

Here are the cauthtool commands to get it to work.

cauthtool --create-keyring /etc/ceph/keyring.bin
cauthtool -c -n i-00000072 --gen-key /etc/ceph/keyring.bin 
cauthtool -n i-00000074 --caps caps /etc/ceph/keyring.bin
cauthtool -c -n i-00000073 --gen-key /etc/ceph/keyring.bin
cauthtool -n i-00000073 --caps caps /etc/ceph/keyring.bin
cauthtool -c -n i-00000074 --gen-key /etc/ceph/keyring.bin 
cauthtool -n i-00000072 --caps caps /etc/ceph/keyring.bin
cauthtool --gen-key --name=admin /etc/ceph/keyring.admin

From the blog post linked above, I used their script to create the directories and copy the ceph.conf to the other hosts.

n=0
for host in i-00000072 i-00000073 i-00000074 ; \
   do \
       ssh root@$host mkdir -p /etc/ceph /srv/ceph/mon$n; \
       n=$(expr $n + 1); \
       scp /etc/ceph/ceph.conf root@$host:/etc/ceph/ceph.conf
   done
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.bin

Then copy the keyrings

for host in i-00000072 i-00000073 i-00000074 ; \
   do \
       scp /etc/ceph/keyring.admin root@$host:/etc/ceph/keyring.admin; \
   done

Then startup the daemons on all the nodes:

service ceph start

And to mount the system:

mount -t ceph 10.148.2.147:/ /mnt/ceph -o name=admin,secret=AQBlV5dO2TICABAA0/FP7m+ru6TJLZaPxFuQyg==

Where the secret is the output from the command:

 cauthtool --print-key /etc/ceph/keyring.bin

Tuesday, September 13, 2011

OSG... There's an app for that

So I thought this weekend would be a good time to learn something new. I put together an OSG Android App. Not all the features are working yet, but I tend to release early and often on all my projects.

I was motivated by SuperComputing '11 coming up. My other visualization is incrementally better than last year, so I wanted to bring something new. Everyone knows the shinier the toys at SC, the better. Who knows, maybe I can use an tablet and show some cool things to people walking by.

So far, the working features are:
- Automatically updated global usage data on the home screen
- Usage queried from gratia for every site and VO
- Graphing of the usage data (though I want to work on that some)

The features I want to work on more:
- Get the map to show sites across the US. Whether it shows RSV data (boring) or usage data (slightly better), I'm not sure yet.
- Plot usage data as stacked line graph, consistent with current gratia plots.
- Get a better looking home screen. It seems too 'black'.

You can get it now on the Android market. Search for osg. It claims to support 2.2+. It works well on my 2.3.3 phone, and works in the 2.2 emulator. Please go and try it out.

Source is on github: https://github.com/djw8605/OSG-Android

DISCLAMER: This app is unofficial.

Monday, August 15, 2011

Upgrading the osg-client to Testing

Another lively blog post about upgrading a set of packages to testing. Last time it was the osg-wn-client, this time it's the OSG-Client. I will use a jira task to track my time and progress on it.

Again, I looked at tickets by component for the osg-client. First, there are no bugs, only tasks. Second, the tasks are major or below (and the major one is a documentation task really).

So, I went to the osg-client koji page, copied the dependencies to a text document, and started the pushing the dependencies to osg-testing. One by one. Really, a whole lot of:

koji tag-pkg el5-osg-testing owamp-3.2rc4-1

I found an extraneous package, the osg-voms-compat. It is no longer needed since we distribute LSC files. I removed it as a dependency of osg-client, and rebuilt the osg-client.

First iteration... only missed 2 dependencies of dependencies. web100_userland and I2Util (both from Internet 2). Second iteration... Success!

We now have osg-client in the testing repo. Took less than an hour.

Sunday, August 14, 2011

Upgrading OSG Software from Development to Testing

In the OSG-software team, there has been a lot of discussion when the right time to upgrade packages to testing from the developer playground. I'm not going to go into the discussion much, other than it should be tracked in Jira.

I'm going to go through my process (as I do it) of upgrading the osg-wn-client from development to testing.

Check Jira for tickets by component. For the osg-wn-client, I look at: osg-wn-client:

Major: SOFTWARE-30: Add LSC files to the vomses. This bug has a first draft of the documentation done. It also is not a functional show stopper, rather it only affects the packagers.
Minor: SOFTWARE-78: Pegasus RPM cleanup / FHS. This is a feature request/task really. Though nice to have, it doesn't change the functionality of the package.

After the bugs have been reviewed, (only 2), and especially since there are no 'blocker' bugs, we start the koji work. First, I look at the osg-wn-client meta package. I will upgrade the dependencies before upgrading the metapackage. For each package in the the dependencies (some are from EPEL, so I don't need to do anything), I will go to the package's koji page, looking for the name of the latest build (or latest tested build). The koji command is similiar to:

koji tag-pkg el5-osg-testing bestman2-2.1.1-5

It's important to note that these are just the first level dependencies of the osg-wn-client. I will also need add packages that are dependencies of the dependencies such as the jdk (srm clients).

Once you have tagged all the builds that are dependencies of osg-wn-client, you can upgrade the meta-package. Then it's time to pull out the VM. I need to test the install, see if I caught all the packages. Since we're upgrading so many packages all at once (likely a one time thing), we need to test the install along the way.

Luckily, I have mash running on a local server, so I can just execute mash to pull the latest from the testing repo. If you want my mash configs, they're on github: https://github.com/djw8605/osg-mash. I needed to edit the /etc/yum.repos.d/osg-testing.repo file in the VM in order to point to my local mash repo.

I did have missing dependencies: gfal, libglite_data_delegation_api_simple_c.so, libglite_data_util.so.0, libglite-sd-c.so.2, vdt-ca-certs. Another iteration (after yum clean all)... Missing libglite_data_delegation_api_simple_c. Another iteration... success!

To find what package provides what library, I just use the web interface, clicking on the RPM's info, to find what the package Provides. I'm sure there is a clever yum command to do most of this for me, but this didn't take me much time anyways.

So, we now have a osg-wn-client in the osg-testing repo. Stay tuned next week for the upgrading of the osg-client. Just need to resolve the blocker ticket for the osg-client.

Wednesday, August 10, 2011

OSG Site Admin meeting: Day 2

Another beautiful (read: HOT) day in Lubbock.

Last night we ate at the National Ranching Heritage Center as part of social get together. The food was great! I had a great time talking other CMS collaborators as well. I talked with some T3 admins from UC Riverside and Baylor, learning about what is like to run a T3 (and talking some Big Bang Theory).

I also talked with Horst from OU to find out that they are no longer running the CoLinux cluster on their desktop machines. Rather, they are using VMWare to run Linux inside Windows on their desktops. Something like 1600 cores.

The heritage center had a number of historical buildings from around the area.

(Credit: Horst Severini more)

The talks have been good. Here's a quick picture from Doug's Globus Online talk.

Tuesday, August 9, 2011

OSG Site Admin meeting: Day 1

Hello from the hot, hot, hot Lubbock.

Had a good morning of broad plenary talks. The indico schedule gives links to the talks.

In the afternoon, I gave my 2 talks: GlideinWMS Frontend install and Campus Factory install. We had one person install and get the campus factory up and working (though, only one was trying to install). We had to add some custom pbs commands for their cluster in the pbs_local_submit_attributes.sh file.

Meanwhile, Jose was installing the glideinwms-vofrontend from RPMs. He had a conflict with an igtf rpm of certificates. Was solved by removing the igtf rpm, and allowing OSG to own the certificates.

Here's a picture from Marco's talk. I'll upload more pictures later.

And from Greg's Talk:

And from my Hands-On:

Wednesday, August 3, 2011

Testing Release of OSG-Client RPM

It's finally here! The first official testing release of the osg-client in RPM form. We're asking anyone and everyone to go out there and start testing. The official instructions are on the OSG Twiki. We even made a Testing Client page if you want to run through some simple commands.

The instructions are simple, install EPEL, then install the VDT repo:

rpm -Uvh http://vdt.cs.wisc.edu/repos/3.0/el5/development/x86_64//vdt-release-3.0-2.noarch.rpm

Install the osg-client (expect a lot of dependencies). My latest build downloaded 140MB of dependencies.

yum install --enablerepo=vdt-testing --nogpgcheck osg-client

Then, go and do your normal workflows. Create proxies, run globus-job-run's, start condor and run jobs. Whatever, just test. If you have any issues, email osg-software@opensciencegrid.org. Or join me on the OSG jabber, osg@conference.indiana.edu.

All current bugs (I'm happy with how few there are) with the osg-client can be found on Jira.

Go forth and test...

Friday, June 17, 2011

Job Visualization

A year or so ago, Ian Stokes-Rees showed me a job visualization that he had put together to see how his workflows where going. Gives a good overview per site. Recently, we where investigating usage from a NEES user, and I adapted the aging script to visualize the workflow. I think it turned out really well.

Note: Be prepared to zoom in a lot. Job ID's are printed at the beginning of execution. Green lines are successful completions. Red lines are Evicted or Job Disconnected jobs.

View on Google Docs

Very ugly source is on github: https://github.com/djw8605/condor_log_analyze

It's times like this I wish I had a tile wall again.

Note, I found chrome's pdf viewer faster than Preview.

Friday, June 10, 2011

HCC Walltime Effeciency

Rob's post on walltime efficiency was interesting. I did the same with HCC and found that we had much better efficiency than I expected. Especially with GlideinWMS's default behavior of sticking around for 20 minutes waiting for a job before exiting.

What's interesting is that we rarely run jobs at RENCI-Engagement, OSCER, or GPGRID. It's difficult to interpret this. Since we run so many workflows, it's not clear to me that this means much.

Need more cores, Scotty!

A researcher/grid user from UNL has come back! Monitoring

100K+ Jobs.

Even though we have a lot of jobs idle, we are only able to get maybe 4.5k running jobs. Doing some investigating, it looks like HCC is competing with GLOW and Engage for slots.

Special thanks to Purdue!

Barely a light load on the glidein machine. This many jobs would have not worked on the old glidein machine, which was a Xen VM.

Of course with Condor, load and memory usage are more a function of # running jobs rather than jobs in queue.

Also, the new GlideinWMS gratia monitoring is picking up the usage:

Wednesday, June 8, 2011

Gratia DB Rates

In my recent discussions with NCSA on gratia, I found something that is probably obvious to other people, but I would still like to point out.

Status captured 6-8-11

According to the OSG PR Display, the OSG runs and accounts for 420,000 jobs daily. That's 4.8 records a second added into MySQL JobUsageRecord table. Also, for each job, summary data is updated in the summary table. So that's at least 5 inserts/updates a second, but most likely many more. All the while, we are able to interactively query the database (well, a replicated one) using pages such as the UNL graphs.

We also keep track of transfers, but they are summarized and aggregated at the probe before reaching the database.

So, in summary, kudos to the gratia operations team.

(NOTE: I may have forgotten some optimizations that we use. But still, that's a lot of records)

Monday, June 6, 2011

Using VOMS Groups and Roles

I’ve been working on gratia graphs with Ashu Guru for the new GlideinWMS probe. I’ve stumbled upon very odd command line usage of voms-proxy-init.

In order to get a role to work with voms-proxy-init, the command needs to be:

voms-proxy-init -voms hcc:/hcc/Role=collab

This command will give the proxy the role collab in the generic group hcc.

VOMS also has the ability to use ‘groups’ in the vo to distinguish between users. In order to use a group, the syntax is:

voms-proxy-init -voms hcc:/hcc/testgroup/Role=collab

This command will give the proxy the role collab in the group testgroup of the HCC VO.

Not necessarily the most complicated command line I have used, but certainly one of the worst documented.

voms-proxy-init -help

doesn’t help at all.

Thursday, April 28, 2011

Overflowing the Tier 2

The last few days, the Nebraska CMS Tier 2 has been full. And I mean really full.

The expected wait time for some of these users (55 active users) is over a month. There are a few questions that arise from this situation:

Why are so many people submitting here?
How can we help users get science done faster?

After talking to people, the first question is a combination of "We have the data they want to run against", and "We have been more reliable in the past". Both are compliments.

The second question is the fun one. To answer this question, we decided to flock jobs to Purdue through our GlideinWMS frontend.

Here's a diagram of our idea.

The Red gatekeepers will advertise idle jobs to the HCC GlideinWMS Frontend.
The frontend will ask for glideins at the Purdue CMS clusters, submitting with a CMS certificate.
When the glideins report back, the Frontend will negotiate with Red's gatekeepers to send jobs to the glideins at Purdue.
Once at Purdue, the CMS analysis jobs will use Purdue's CMS software install, and XRootd to stream data from UNL.

This behavior will require a lot of configuration.

GlideinWMS Configuration

The GlideinWMS Frontend needed to add another group that would query the red gatekeepers. We named this group T2Overflow.

The glideins that start as a part of the T2Overflow group will only run jobs from the T2. We accomplished this with a new start expression in the glideins:
<attr name="GLIDECLIENT_Group_Start" glidein_publish="False" job_publish="False" parameter="True" type="string" value="(TARGET.IsT2Overflow =?= TRUE)"></attr>

Next, we want the glideins to be submitted as the CMS user, since they are CMS jobs. For this, we used Brian's certificate. We changed how we specify the proxy used to submit, from globally at the top of the frontend.xml file, to per group. If it is defined both places, the frontend/factory will round-robin between the proxies.

Next, we want to submit to the Purdue CMS cluster, this required changing our entry point match expression to:
<factory query_expr="(GLIDEIN_Site == "Purdue")&&((stringListMember("CMS", GLIDEIN_Supported_VOs)))"></factory>

Gatekeeper Configuration

The jobs that will flock to Purdue are limited by the glideins START expression listed above. Now, we want the jobs to only flock to the T2Overflow group, not other groups running on the HCC Frontend. In order to do this, we needed to change the requirements of the jobs and inside the condor.pm to include:
( IS_GLIDEIN =!= true || TARGET.GLIDECLIENT_Group =?= "T2Overflow" )

This tells Condor: If we're matching to a glidein (IS_GLIDEIN = true), then only match to the T2Overflow group.

The jobs need the IsT2Overflow = true in order for the frontend to pick up the idle jobs. The IsT2Overflow attribute is also used in the START expression on the glideins to only start overflow jobs. This attribute was added to each job in the condor.pm using the condor syntax:
+IsT2Overflow = false

This way, jobs are automatically disabled from overflowing, but can be easily changed by doing a condor_qedit.

Also for the jobs, we need to configure the environment. When a globus job is submitted to Condor, the Environment classad is filled with site specific attributes. When we flock to Purdue, we want to erase those attributes and instead use the ones given to us from the glidein environment.

The environment changing proved difficult. Our goal is to use an expression such as:
Env = $$([ ifThenElse(WantsT2Overflow=?=true, "", "MyEnv=yes") ])

In our testing, the environment ended up empty in both the True and False cases.

UPDATE: 4/29 9:30AM
We figured out the error with the environment, and opened a Condor ticket. In the mean time, we figured out a work around:
Environment = "$$([ ifThenElse(IsT2Overflow =?= TRUE, \"IsT2Overflow=1\", \"...\") ])"
Where the ... is the original environment. This will cause the environment to stay the same if the job is executed on the Nebraska T2, but to wipe out the environment if executed on in the Overflow nodes.

We have the Overflow working for ~500 jobs now.

Saturday, April 9, 2011

Day 7 at CERN - Saturday

For breakfast, fruit and a croissant.

Brian and I walked to the bus stop at CERN, tried deciphering the bus schedule. We found the correct bus (after walking to all 3 bus stops at CERN), and rode it to the Tram. The tram ride was ~20 minutes, and took us directly to downtown Geneva.

First we found the restaurant, Cafe de Paris, that we where going to meet my cousin. It was very near the tram stop. It was now 9:45, and we where planning a noon lunch with my cousin. We therefore walked around the pedestrian square around the restaurant. I was on a mission for a Swiss watch.

We walked down to the river and saw the fountain:

On the other side of the river, we walked along the shops. We saw a very oddly named shop, Globus. For those of you in the grid community will understand the humor:

The Globus shop was just a regular K-Mart like store (but much, much more expensive).

The shopping on the south side of the lake was much more expensive than in the pedestrian square. It was also more focused to high fashion, something I'm not too interested in. Stores where dedicated to Gucci, Rolex, ...

We walked up the hill to the old city. There we found a cathedral:

This cathedral was founded in the Roman area (first century AD). There was a museum under the cathedral cataloging the history of the cathedral.

There was several layers in the museum, and it stretched the very large cathedral. It also had the skeleton of a very early leader of the cathedral. You can notice the round hole near the head of the skeleton; It was dug up some time after it was buried to be displayed.

The cathedral museum even had the original floor of the Bishop's residence. The drawing on the floor (even a little of the color) was still there, even though the floor was very warped:

After leaving the Cathedral, we got a few pictures of the magnitude of the structure we where underneath:

Those are people way up in the tower.

We walked back to the pedestrian square and the Cafe De Paris. There we had lunch with my cousin.

After lunch, I shopped a little more before purchasing a Swiss watch from a local store.

At this time, it was a little after 2, and we decided to head back to CERN for a nap.

We spent the rest of the evening sitting on the patio, eating dinner and enjoying a beautiful view of the Alps on our last day.

(I should have taken my hat off)

Friday, April 8, 2011

Day 6 at CERN - Friday

Breakfast was orange juice, fruit, and a croissant.

Morning Session
This morning was the site related meetings. We had a talk from Oliver from FNAL regarding data operations. It concluded that we should separate tape from disk servers, effectively creating a Tier-2 at the Tier-1's, and using the tape as a separate storage.

Then next talk was on glexec. It concluded that glexec is not properly installed at most sites. Actually, only 1 Tier-2 is properly installed. With figures like that, one has to think it's not the site's fault, there has been a mis-communication. Brian pointed out that there has been multiple mixed messages regarding whether glexec was important or not.

The other talks this morning where on a monitoring technology HappyFace, and minimum site size. Both of these talks had little discussion.

For lunch, I met with Dirk and a colleague of his here at CERN. I had spaghetti with tomato sauce.

Afternoon
Immediately after lunch, Brian and I met with Aaron D. for a tour of the 0.5 site. The 0.5 site is the location of the CMS detector (0.5 is half way around the ring).

The Control room has many monitors with information on them.

Next, we went to the assembly hall. This is a very tall building where the detector was built (in pieces) and sent down the shaft to the experiment hall.

In this picture, you can see the large scale of the cranes and the building, which is roughly 5 stories high. It's currently used as a storage area. The blue scaffolding in the pictures is half the height of the detector. The scaffolding was used to access the beam pipe (made partly of beryllium) while inserting the pixel detector (designed and built at Nebraska).

After the tour, Brian and I sat around with Chris talking about the future of CMSSW (theme for the rest of the day).

We joined the USCMS meeting at CERN. There we heard about the outlook for funding in the US: unknown.

For dinner, we went to a Indian restaurant that is frequently visited by CMS. We ordered 4 different appetizers, 8 different entrees (there where 6 of us). Each of use tried all of the food.

When we got back from dinner, we sat outside restaurant 1, discussing the budget and the future of CMSSW. Especially how we are going to deal with the ROOT IO layer.

Thursday, April 7, 2011

Day 5 at CERN - Thursday

Breakfast was Choco Puffs, Fruit, and Orange/Passion fruit juice.

Today was largely Management meetings, so I was free to play tourist at CERN. I went to the Reception area:

There I bought stuff for the family. I sure hope Tristan's shirt will fit.

I then walked past the Dome. I learned later that it opens at 10, I was an hour early.

Next, I walked by the ATLAS control room. It is positioned above the ATLAS detector, which is ~100M underground.

I wish I could visit the CMS Control room, but it's located on the other side of the ring, at 0.5.

For lunch, a few people from FNAL drove downtown to a Kabob restaurant, though no one actually had Kabobs. Instead, we had very large pita sandwiches. I had a beef pita with tomatoes, onions, and lettuce.

In the afternoon, I worked on a few OSG things (mostly debugging campus factory). I met with CernVM-FS people to discuss possibly making it into a generic library that can be used by other applications (non-fuse).

For dinner, Chris, Eric, and I walked to a restaurant towards Geneva.

I had Penne with meat sauce. Very good!

On the trip back, I took a few pictures (didn't turn out all that well). This is a picture of the road to CERN from our restaurant. You can see (fuzzy) the CERN Dome.

We also saw the new Tram for CERN.

Wednesday, April 6, 2011

Day 4 at CERN - Wednesday

Breakfast was Milk, Orange juice, Fruit, Vanilla (Natural) yogurt, and croissant.

Morning Session
This morning is Monitoring Task Force. This stuff is very interesting. I'm noticing a lot of CouchDB in the talks. I've read a little about CouchDB, it looks just like a distributed key/value store.

Talks on Xrootd and NFS4. They compared Xrootd and NFS4, both have bonuses and drawbacks. Xrootd has better vector reads, NFS4 is better at local access (maybe). Brian gave his talk on Xrootd. A lot about the demonstrator, and how it has progressed. Also the current schedule for improvements. There was a lot of discussion on monitoring, and how MonaLisa was a good base, but wasn't providing the monitoring we need for full deployment.

Lunch was pasta with tomato sauce.

Afternoon Session
LHC One talk: We are relying more on the network, so need better than best-effort network QOS.

A few more talks before we got to IO. Brian gave a good talk on the ideas he has on IO. It mostly dealt with compression and basket size for internal serialization.

There was a talk from another developer about how CMS IO was not designed for the world we're living in. There was a lot of discussion about this talk, not so much about the content of the talk, but about how CMS should look in the future.

Next was a talk from a FNAL grad student that replaced ROOT IO with google protobuffers. It was interesting, albeit probably not practical for all of CMSSW. But it did bring up a good discussion, that it's not difficult to get many times faster IO by bypassing ROOT IO. How do we approach the deficiencies of ROOT IO? Do we talk with ROOT? Do we make something an give it to ROOT? Certainly we don't want to 'own' a IO layer.

Dinner we went to a restaurant on the north side of the Geneva Valley. It had a great view of the Alps.

The restaurant was a regular french restaurant. I had a salad with ham, egg, and cheese. The main course was chicken, potatoes, and green beans. The desert was banana split with chocolate, strawberry, and vanilla ice creams. The receipt isn't very descriptive, I hope UNL will take it.

The restaurant was also a hotel.

Tuesday, April 5, 2011

Day 3 at CERN - Tuesday

This morning was mostly spent fixing my user registration and getting my CERN account. Breakfast was a slice of keish, fruit, and an orange juice.

Morning Session
I attended a morning session on the new Data access system (combines phedex, dbs, ...). It provides a web user interface and can produce JSON as well. There was a lot of talk about capabilities available on DBS but not on DAS.

For lunch, we walked over to Restaurant 1. I had rice and shrimp. It was very nice outside, weather says 61F, so we sat outside. The conversation centered around changes that could be done to ROOT.

Afternoon Session
The afternoon talks have been focused CMS's grid usage. Talks where on Phedex. But there was also a lot of talk on GlideinWMS as well. Most/All of the CMS Production is moving to GlideinWMS. Working with Igor and Burt to get the frontend and factory installed. People seem to understand GlideinWMS, but not necessarily how it all hooks into CMS Production in the US (through job router, for some reason).

There where talks on Phedex and Crab 3. Crab 3 was depreciating the local submit mode. This brought some attention from others in the room. The local submit will likely be re-considered, especially since it should be so similar to the GlideinWMS method.

Before dinner, I sat down with a few folks from FNAL (and a few from CERN) to talk over changes in ROOT with Brian. Brian had just finished his few changes to compression that decreased file size of RECO by ~25%. But, bigger news, found a Budweiser that was good. But it was from the Czech Republic.

For dinner, Brian and I ate at restaurant 1 across from our hostel. We had a very good discussion about the data flow of the CMS workflow. I learned most of where the data all goes.

Tonight I got some work done, finished up this blog post.

Monday, April 4, 2011

Day 2 At CERN

Woke up this morning at 7:30. Grabbed some breakfast across the street from the Hostel. I had orange juice with passion fruit (whatever that is), mixed fruit, and a chocolate french roll.

After this, Brian and I walked to the Users' Center to create my account. Then, we walked to the ID badge place to get my CERN ID. At this time, it was raining.

After getting my ID, we walked to meeting building, which happened to be on the other side of CERN (I walked across the Swiss/French border, can't even tell).

Morning Session
Listened to 3 talks. The first two where very CMS and physics heavy. I didn't understand much. Ian gave a talk about the future challenges in CMS computing. Punchlines where:

Redesigning the CAF (CERN Analysis Farm) to have better data access (integrate Xrootd redirector)
We're running out of Cycles at CMS T1's
CMS reduced the AOD (Analysis-Oriented Data) copies. More transfers between sites rather than strict adherence to tree, trickle down, theory.

Lunch was at Restaurant 2. Had pasta with white sauce and small chunks of ham on top, salad, and a Coke. The Coke, like most drinks here, was not all that cool. I guess Americans are weird with their need for ice and very cold drinks. Pasta was good, but big.

Afternoon Session
A few physics talks that I didn't quite grasp. Geant4 talk. Fast simulation talk.

Learned a lot about the CMS Production workflow.

Questions:

Seems odd that they have to warn sites what files they will run on, shouldn't we be running better technology? This sounds like an issue with dCache? Do they do this on Hadoop Clusters? Lustre?
Didn't mention if these production jobs run outside the T1's. I certainly see something from CMS Production at Nebraska.

T1 Processing. Basically we're going to run out of computing after the end of the year. Especially if the data collecting is like last year, where 75% data was taken after September. This makes me wonder why CMS doesn't invest in what Miron has said before: Make it run everywhere and my toaster.

After the talk, we went to the 'CMS Drink' at a building right next to our Hostel (though it was a walk from where the meetings where held). This was just an horderve and wine event. I forgot how much I didn't like red wine. But the white wine was tolerable.

During the 'Drink' session, a few of the people that Brian knows planned dinner at a local chicken place, Chez Ma Cousine.

And a few people outside of the entrance after eating.