Friday, February 17, 2012

Ceph on Fedora 16

I've written before how to run ceph on Fedora 15, but now I'm working on Fedora 16.

Last time I complained about how much ceph tries to do for you.  For better or worse, now it attempts to do more for you!

For my setup, I had 3 nodes in the HCC private cloud.  First, we need to install ceph.
$ yum install ceph

Then, create a configuration file for ceph.  The RPM comes with a good example that my configuration is based on.  The example script is in /usr/share/doc/ceph/sample.ceph.conf

My configuration: Derek's Configuration

The configuration has the authentication turned off.  I found this useful because the ceph-authtool (yes, the renamed it since Fedora 15) is difficult to use.  And because all of the nodes are on a private vlan only reachable by my openvpn key :)

Then, you need to create and distribute ssh keys to all of your nodes so that the mkcephfs can ssh to them and configure.
$ ssh-keygen 

Then copy them to the nodes:
$ ssh-copy-id i-000000c2
$ ssh-copy-id i-000000c3

Be sure to make the data directories on all the nodes.  In this case:
$ mkdir -p /data/osd.0
$ ssh i-000000c2 'mkdir -p /data/osd.1'
$ ssh i-000000c3 'mkdir -p /data/osd.2'

Then run the mkcephfs command:
$ mkcephfs -a -c /etc/ceph/ceph.conf

And start up the daemons:
$ service ceph start

You should have the daemons running then.  If they fail for some reason, they tend to output what the problem was.  Also, the logs for the services are in /var/log/ceph

To mount the filesystem, find an ip address of one of the monitors.  In my case, I had a monitor on ip address 10.148.2.147.  The command to mount is:
$ mkdir -p /mnt/ceph
$ mount -t ceph 10.148.2.147:/ /mnt/ceph

Since you don't have any authentication, it should work without problems.

I've had some problems with the different mds, even had a OSD die on me.  It resolved itself, and I even added another OSD to take it's place, recreating the CRUSH table.  Since creating this, I have even worked with the graphical interface:

And here's a presentation I did about the CEPH Paper.  Note,  I may not be entirely accurate in the presentation, do be kind.

Tuesday, February 7, 2012

Fedora 16 on OpenStack

After following Brian's guide on installing Fedora 15 on OpenStack, I thought I would try my hand at Fedora 16.  There where a few differences.

Filesystem Differences
Brian's guide installed Fedora using LVM.  I installed Fedora without LVM (there's a little checkbox on the partition page of Anaconda).  Without LVM, I can skip the steps on listing the physical volumes and logical volumes to find the start and end of the partition.

Also, Fedora 16 uses gpt partition.  fdisk command cannot read the partition table, therefore I had to install gdisk (in epel).  Running it has very similar command and output:

$ /usr/sbin/gdisk -l /tmp/fedora16
GPT fdisk (gdisk) version 0.8.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /tmp/fedora16: 20971520 sectors, 10.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): A351197B-8233-4811-9B28-69A1DE121AD2
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 20971486
Partitions will be aligned on 2048-sector boundaries
Total free space is 4029 sectors (2.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048            4095   1024.0 KiB  EF02  
   2            4096         1028095   500.0 MiB   EF00  ext4
   3         1028096        16777215   7.5 GiB     0700  
   4        16777216        20969471   2.0 GiB     8200  


Then, to extract the image:
dd if=/tmp/fedora16 of=/tmp/server-extract.img skip=1028096 count=$((16777215-1028096)) bs=512

SSH Key Differences
Brian's guide instructed you to create a /etc/rc.local.  Fedora 16 sees the introduction of systemd, which no longer executes rc.local.  Instead, it looks for the file /etc/rc.d/rc.local (possibly a symlink to /etc/rc.local?).  This file needs to be executable and be sure to include the shebang.

Also, Fedora 16's selinux doesn't label the root file system correctly (BUG), and simply making the .ssh directory doesn't not allow sshd to read it.  To solve selinux problem, I disabled selinux (bad, bad me).


Common Commands
After installing Fedora 16 into an image, and extracting the kernel and ramdisk, there where a few commands that where executed over and over as I debugged the image:

Make the changes to the image:
sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/fedora16 -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize

Extract the partition:
dd if=/tmp/fedora16 of=/tmp/server-extract.img skip=1028096 count=$((16777215-1028096)) bs=512

Start the VM to change the label on the image:
sudo /usr/libexec/qemu-kvm -m 2048 -drive file=/tmp/fedora16 -net nic -net user -vnc 127.0.0.1:0 -cpu qemu64 -M rhel5.6.0 -smp 2 -daemonize -drive file=/tmp/server-extract.img 

Rename the image to something appropriate:
mv /tmp/server-extract.img /tmp/fedora16-extracted.img

Bundle the image for OpenStack:
euca-bundle-image --kernel aki-0000002e --ramdisk ari-0000002f -i /tmp/fedora16-extracted.img -r x86_64

Upload the image to OpenStack:
euca-upload-bundle -b derek-bucket -m /tmp/fedora16-extracted.img.manifest.xm

Register the image (this command completes fast, but openstack takes for ever to decrypt and untar the image):
euca-register derek-bucket/fedora16-extracted.img.manifest.xml


Now to build OSG packages for Fedora...  maybe not.


Tuesday, January 24, 2012

Testing an Globus Free OSG-Software (From EPEL(-testing))

As you may or may not know, there is a massive globus update pending in EPEL that will update globus to the version the OSG distributes.  What this means is much less work for the osg-software team since we will not have to build and support our own builds of globus.

Testing the globus from EPEL while installing some packages from osg repos is not a trivial matter.

  1. Disable the priority of the OSG repo
  2. Exclude globus and related packages that are already in EPEL from the osg repo.
Below is my final file /etc/yum.repos.d/osg.repo

Notice the many excludes in the file, the list may not be complete.

Installation is just:
yum install osg-client-condor --enablerepo=epel-testing

UPDATE!!!!
Testing Results
very good!

I ran 3 tests, all completely successful.
1. globus-job-run against a rpm CE.
$ globus-job-run pf-grid.unl.edu/jobmanager-fork /bin/sh -c "id"
uid=1761(hcc) gid=4001(grid) groups=4001(grid)
2. Condor-G submission
Condor-G Submission worked without problems.  The submission file is below:
3. And globus-url-copy worked:
$ globus-url-copy gsiftp://pf-grid.unl.edu/etc/hosts ./hosts

Friday, January 20, 2012

Initial EL6 Packages for OSG

Last night I completed initial packages for EL6 support.  Just like for EL5, the first OSG component I created is the osg-wn-client.

The osg-wn-client has a complicated dependency tree.  Easily some of the most difficult packages where form glite.

Just some quick tidbits that made the transition easier:

UUID Differences
uuid.h and the associated library is used by many applications.  In el5, uuid is provided by the e2fsprogs package.  In el6, it has it's own package, libuuid.  It was common for me to copy this tidbit into a few packages:
gsoap Differences
glite-fts-client and glite-data-delegation-api-c both use gsoap.  In the past, it was common to copy stdsoap2.c from the gsoap distribution and compile that into your program.  Now that gsoap is a regular library though, it should be linked into the system's version.  In order to do this, I had to add patches to the Makefiles for both packages to link against the system's gsoap.


What's next?  
The next step is the osg-client.  Since there are no more glite packages for the osg-client, this step should be easier.

Tuesday, October 25, 2011

September Progress Report

Since September was a while ago, I'll keep this short.  Most of Sept. was figuring out the class/work/research schedule.  It had been a 2 years since I've taken anywhere near a full load of classes, and it's funny how quick you forget the work a class takes.

OSG Software
During September, I continued to help with the OSG software effort.  By the beginning of September, I had handed most of my tasks off to others, but I still contributed here and there.  I especially contributed to discussions on the software that I had built, this was especially evident for the Condor build in the OSG repositories.

For the OSG Condor build, I backported the 7.7 builds from the Fedora distribution.  There was some discussion on the method I choose to build the Condor version was the best.  I would argue it is the best of a bad situation.  There is no 'proper' build of Condor for RHEL5.  This is very much on purpose, since Redhat won't allow a EL5 build of Condor in EPEL since it's distributed in their MRG product.  Also, the RPM's that the Condor team produce do not conform to the OSG software standards.  They are statically linked against a lot of libraries.  They don't have source RPMs.  They also just recently started putting things in the right locations, /etc, /usr/bin...

But, by backporting the Fedora build, we didn't get CREAM support.  We never had CREAM support before in the VDT, but we would still like it, since it's in the 'binary blob' Condor builds.  We would need a properly packaged CREAM client in order to do this.

I'm also not a fan of removing Condor from the osg-client.  Condor has been a member of the OSG client since the very beginning.  Condor-G is the base method for submitting globus jobs for almost all systems we use today.  I can imagine a user downloading the osg-client, then wanting to run a job.  But they can't, because they need to install Condor too (which is easy, but still).

SuperComputing Conference Prep
In the month of September, I started developing and forming my idea for Supercomputing visualization.  If you don't know, I LOVE visualizations.  Especially when it explains a very complex system in an understandable way.  Self promotion: Youtube channel

The first task for SuperComputing was to add GLOW usage to the google earth display.  While I was at it, I added documentation on how to install and run the OSG google earth display on the twiki.

There where no great ideas on how to expand the current google earth display.  We could add transfers, but we don't have good transfer accounting on a per-VO level.  Plus, not many VO's do file transfers outside of CMS and ATLAS.  Another idea was to incorporate globus online traffic.  But again, I don't think the number of transfers, especially to and from OSG sites, is high enough to show that traffic.  Maybe one day it will...

So, I turned to something that I had wanted to learn for a while, Android.  In the course of a weekend, I built an OSG Android App.  After demonstrating the app at HCC for a week, I was able to purchase a Android tablet that will be interactively displayed at SuperComputing.  The goal of the application is to provide interactive status of the OSG.  Either at the site level, or by VO.

Campus Grids
We got Bill from VT running through the Engage submit host.  He is flocking to the Engage submit host, and then out to the OSG, all while using whole machine slots.  Additionally, he's flocking to campus factories that are also running whole machine jobs.  He was very quick to get it setup, and then start running actual science through the system.  His usage is currently monitored by gratia.  He hasn't been real active lately, you may need to expand the start and end dates.  We should thank the Renci admins Brad and Jonathan for their with the setup, they have been super fast learners of the glideinwms system.  And, maybe more importantly, willing to experiment!

There where a few issues this month with Gratia collection.  It mostly boils down to mis-understandings on what we are accounting, what is possible (practical) to account (not the same as what we are accounting), and what counts as an 'OSG job'.

We enabled Florida running with the campus factory.  They had it setup before, but something had changed and caused some held jobs.  It turned out to be harmless, but still a sign of how multiple layers can complicate the system.  At least the users only see 1, Condor.

Cloud Platform
When I got back after spending the summer at FNAL, there where a few changes at HCC.  First, we had a 'cloud' platform that was just starting to take shape.  HCC built a Openstack prototype on a few nodes.  I helped beta test the cloud, figuring out a few bugs.  I'm happy to report that it is now 'just working'.  It has been especially useful when I want to try out new software quickly.  For example, last week I quickly installed CEPH distributed file system, and found it usable.  It also has been very useful for quick install tests of OSG software.

Misc.
Started using the GOC factory.  Very happy to have some fault tolerance on our GlideinWMS install.  Though in practice, UCSD has been very stable, it's a nice piece of mind.

New user on our glideinwms machine.  Monica from Brown university running some Neutrino experiments (I'm not a physicist!) .  For now she's running as HCC, but that may change in the future if usage increases.

Various Grid administration of the HCC site.  I've been transitioning off of administrating most of the HCC resources that I used to maintain, freeing me for class work mostly.




Friday, October 21, 2011

SC11 Visualization Prep

Last year I put together a visualization for SuperComputing 10' to show jobs moving over the OSG.



This visualization has since played at HCC on the tile wall.  Here's a video from youtube.


I will be showing it again this year at SC11 in Seattle.  I also will be showing my OSG android tablet application, which is currently on the App Store:
Available in Android Market

Instructions for installing and running this visualization are at:
https://twiki.grid.iu.edu/bin/view/Main/InstallingOSGGoogleEarth


Friday, October 14, 2011

CEPH on Fedora 15

Yesterday, I read a blog post using CEPH for a backend store for virtual machine images.  I've heard a lot about ceph in the last year, especially after it was integrated into the mainline kernel in 2.6.34.  So I thought I'd give it a try.

Before I get into the install, I want to summarize my thoughts on Ceph.  I think it has a lot of potential, but parts of it are trying too hard to do everything for you.  I always think there is a careful balance between a program doing too much for you, and making you do too much.  For example, the mkcephfs script that creates a ceph filesystem will ssh to all the worker nodes (defined in ceph.conf) and configure the filesystem.  If I was in operations, this would scare me.

Also, the keychain configuration is overly complicated.  I think the Ceph is designed to be secure over the WAN (secure, not encrypted), so maybe it's needed.  But it seems overly complicated when you compare it to other distributed file systems (Hadoop, Lustre).

On the other hand, I really like the full posix compliant client, especially since it's in the mainline kernel.  It is too bad that it was added in 2.6.34 rather than 2.6.32 (RHEL 6 kernel).  I guess we'll have to wait 2 years for RHEL 7 to have it in something we can use in production.

Also, the distributed metadata and multiple metadata servers are interesting aspects to the system.  Though, in the version I tested, the MDS crashed a few times (the system picked it up and compensated).

On Fedora 15, ceph packages are in the repos.
yum install ceph

The configuration I settled on was:
[global]
    auth supported = cephx
    keyring = /etc/ceph/keyring.admin

[mds]
    keyring = /etc/ceph/keyring.$name
[mds.i-00000072]
    host = i-00000072
[mds.i-00000073]
    host = i-00000073
[mds.i-00000074]
    host = i-00000074

[osd]
    osd data = /srv/ceph/osd$id
    osd journal = /srv/ceph/osd$id/journal
    osd journal size = 512
    osd class dir = /usr/lib64/rados-classes
    keyring = /etc/ceph/keyring.$name
[osd0]
    host = i-00000072
[osd1]
    host = i-00000073
[osd2]
    host = i-00000074

[mon]
    mon data = /srv/ceph/mon$id
[mon0]
    host = i-00000072
    mon addr = 10.148.2.147:6789
[mon1]
    host = i-00000073
    mon addr = 10.148.2.148:6789
[mon2]
    host = i-00000074
    mon addr = 10.148.2.149:6789

As you can read from the configuration file, all files are stored in /srv/ceph/...  You will need to make this directory on all your worker nodes.

Next I needed to create a keyring for authentication with the client/admin/dataservers.  The keyring tool is distributed with Ceph, and is called cauthtool.  Even now, it's not clear to me how to use this tool, or how Ceph uses the keyring.  First you need to make a caps (capabilities?) file:

osd = "allow *"
mds = "allow *"
mon = "allow *"

Here are the cauthtool commands to get it to work.

cauthtool --create-keyring /etc/ceph/keyring.bin
cauthtool -c -n i-00000072 --gen-key /etc/ceph/keyring.bin 
cauthtool -n i-00000074 --caps caps /etc/ceph/keyring.bin
cauthtool -c -n i-00000073 --gen-key /etc/ceph/keyring.bin
cauthtool -n i-00000073 --caps caps /etc/ceph/keyring.bin
cauthtool -c -n i-00000074 --gen-key /etc/ceph/keyring.bin 
cauthtool -n i-00000072 --caps caps /etc/ceph/keyring.bin
cauthtool --gen-key --name=admin /etc/ceph/keyring.admin


From the blog post linked above, I used their script to create the directories and copy the ceph.conf to the other hosts.

n=0
for host in i-00000072 i-00000073 i-00000074 ; \
   do \
       ssh root@$host mkdir -p /etc/ceph /srv/ceph/mon$n; \
       n=$(expr $n + 1); \
       scp /etc/ceph/ceph.conf root@$host:/etc/ceph/ceph.conf
   done
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.bin


Then copy the keyrings
for host in i-00000072 i-00000073 i-00000074 ; \
   do \
       scp /etc/ceph/keyring.admin root@$host:/etc/ceph/keyring.admin; \
   done


Then startup the daemons on all the nodes:

service ceph start

And to mount the system:
mount -t ceph 10.148.2.147:/ /mnt/ceph -o name=admin,secret=AQBlV5dO2TICABAA0/FP7m+ru6TJLZaPxFuQyg==

Where the secret is the output from the command:
 cauthtool --print-key /etc/ceph/keyring.bin