CarrollOps

A Blog by Jeremy Carroll

S3 Yum Repos With IAM Authorization

I wanted to get away from some of the Big Data posts I have been doing, and focus on the cloud. When I first investigated moving infrastructure to the cloud, I had a need for software repositories that my instances could download packages from. I currently use CentOS 6.x as the OS for AMI images on EC2. I created the smallest instance I could just to configure an apache server to deliver yum repositories. I then had to manage the security groups, availability, and scalability of the web service. It works, but I wished there were hands-off alternatives.

About 7 months ago I came upon a yum plugin https://github.com/jbraeuer/yum-s3-plugin that would allow me to use an S3 bucket as a yum repository. It’s a fantastic idea as S3 spans all US regions, I don’t have to pay for a 24/7 instance running, and it’s scalable. I’ve used this plugin to great success, but there is one minor inconvenience. For the plugin to work you must configure an access_key + secret_key to the instances you create. I took the approach which I felt was the best compromise between security and ease of use. I created user accounts with Identify and Access Management (IAM) which had very restrictive rights (Only to getObject from my yum s3 buckets). I have been recently tasked with setting up additional cloud infrastructure, and went back to GitHub to see if any progress has been made recently with S3 yum plugins. I then discovered a project which does not require access keys to be installed to the instances https://github.com/seporaitis/yum-s3-iam. Huzzah!

In June, Amazon released a feature where you can assign IAM roles to an instance at creation time http://aws.typepad.com/aws/2012/06/iam-roles-for-ec2-instances-simplified-secure-access-to-aws-service-apis-from-ec2.html. These roles with make available additional metadata in the instance which generates a temporary access_key, secret_key, and token which changes every few minutes to use for API requests. In this case it allows us to make requests to S3 without installing any keys or performing key management.

I have submitted a Pull Request to the original author to address a bug fix, and add two enhancements to the code https://github.com/seporaitis/yum-s3-iam/pull/4. I will be assuming at the time of this writing that you are using the version with my PR attached, or it has been incorporated into the main branch.

Installation

  • Create IAM Role (e.g. ‘s3access’) and set a policy that gives s3:GetObject permissions to that role.
  • Launch instances with this role assigned.
  • Inside the instance:
    • Copy s3iam.py to /usr/lib/yum-plugins/
    • Copy s3iam.conf to /etc/yum/pluginconf.d/
    • Configure your your repository as in example s3iam.repo file.

IAM Role

Inside of the IAM console, choose ‘create role’. Give the name of the role (such as s3yum_access). When it comes time to create a policy, select ‘custom policy’. Use the policy I have defined below. It allows the account the ability to list all S3 buckets and only get from the base bucket, and all sub S3 buckets for our custom yum repository. This may not be the most restrictive policy, but it has worked for me. Please replace bucket-name with the name of your S3 bucket.

{
  "Statement": [
    {
      "Action": [
        "s3:ListAllMyBuckets"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::*"
      ]
    },
    {
      "Action": [
        "s3:GetObject"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::bucket-name"
      ]
    },
    {
      "Action": [
        "s3:GetObject"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::bucket-name/*"
      ]
    }
  ]
}

Creating a Yum repository

I currently use the S3 tools http://s3tools.org/s3cmd packages to rsync local files to S3. Create a local repository structure like your normally would, and then sync it to your S3 bucket. If you do not know how to create a yum repository, there is a lot of documentation on google to help you out. The jist of it is to copy your RPM files into a directory, and then run createrepo on it. After issuing the createrepo command it will create the filelists / repomd.xml for it to be a working yum repo. Then sync these directories to S3 however you wish.

Sample Directory layout

noarch
  noarch/cloud-init-0.6.3-0.5.bzr532.el6.noarch.rpm
  noarch/s3-yum-0.1.0-1.noarch.rpm
  noarch/python-boto-2.5.2-1.noarch.rpm
x86_64
  x86_64/python-PyYAML-3.10-1.x86_64.rpm

S3 tools sync

createrepo noarch
createrepo x86_64
s3cmd sync . s3://bucket-name/

Configuring Yum repository

In my pull request I added a feature from the s3-yum python script I was used to. Add s3_enabled=true to enable a repo to be an S3 repo instead of a regular one. This allows you to mix and match repositories from the web with private local S3 repositories.

[s3-noarch]
name=S3-noarch
baseurl=http://bucket-name.s3.amazonaws.com/noarch
enabled=1
s3_enabled=1
gpgcheck=0

Putting it all together

We now have a configured repository inside of S3 with our RPM packages. We have setup a role to use in launching new EC2 instances. We have granted this role rights via IAM to getObjects inside of our Yum S3 bucket repository. Finally we have installed the yum plugin on the AMI instance which will look up the temporary access keys to use to fetch data from our S3 bucket like a private yum repository. It’s pretty easy to test if this is working or not. Run something like this to verify.

yum clean all
yum search <package which only exists in my custom repo>
yum install -y <package which only exists in my custom repo>

Success will rain down upon you as you have created a scalable package serving infrastructure for all of your AMI’s in any availability zone. Comments / suggestions are welcome.

Hadoop Fair Scheduler Monitoring - Part 2

In my first blog post I went over setting up a simple fair scheduler monitoring script using mechanize to scrape stats from the JobTracker page. Now that all the metrics are in our graphite system (In this case, I used graphite due to nice front-end vis) I wanted to show how we can put these metrics to work.

The goal was to measure requests for resources (Map slots / Reduce slots) for different job queues. I can then determine certain parts of the day where there is a lot of contention for resources to see if a job can be rescheduled to a different time when there is more slots available. Another option would be to change the fair-scheduler minShare, preemption, and weight settings to help shape the SLA of running jobs. Before we can start setting these policies it would be nice to know map / reduce slot demand over a timeline. Also look at ‘running’ map / reduce slots to see what resources the scheduler gave during those time periods. Here are some questions the metrics can answer for us.

  • During what times of the day is the cluster heavily over-subscribed?
    • Is the cluster using more map or reduce slots during these timeframes?
  • What queues / users are requesting resources?
    • How many jobs are they running?
    • How many Map / Reduce slots are they requesting?
    • How fast are they completing map / reduce slots?

I put together a series of visualizations to help debug an oversubscribed Hadoop cluster. Using this information was invaluable in determining winners and losers of cluster resources. It did not help debug individual MapReduce jobs. You would have to look at individual job performance such as spilling to disk, not using a combiner or secondary sort, or sending too much data uncompressed over the wire, etc… But it does help determine overall cluster utilization.

Demand The image here represents map + reduce slots requested by running jobs over a 24 hour period. As jobs complete, the total number drops. When you start to see solid straight lines (plataus) in the image, it means that no new jobs have been added or removed from the scheduler for this pool. The current jobs are just taking a while to complete. Nice to get a big picture of how many tasks are being run by the cluster by pool.

Jobs This graph shows how many jobs are currently in the fair scheduler by pool. Easy to see bursts of activity from end users. The primary purpose of this is to help tune the maxRunningJobs parameters per queue. It’s also nice to see day / night patterns of activity during for processing cycles. See if somebody can shift a job a couple hours to take advantage of lower utilization periods.

Map Slots This metric shows the amount of map slots running at any given time. It’s great to determine what pools are taking the majority of the resources. Also shows if there are long running jobs taking the majority of the queue for long periods of time. With preemption policies enabled, you can also see slots getting killed to free up space for a pool to meet it’s minShare / fairShare.

Map Speed This is one of my favorite graphs from this tool. You can see the total number of map slots requested versus the number of map tasks completed. Basically shows you job progress at a high level, and how fast the tasks are completing. Since this does not take into account new jobs entering and adding tasks, it’s a good high level gut check.

Maps Outstanding Using Graphite’s function for a difference of two series, this is the same graph as the image before, but by dividing the total map slots by maps completed. This will give you the number of remaining map tasks to be processed.

Queue As with most of our jobs the map side completes quickly, but blocks for long periods of time waiting for data to be shuffled across the network. So we usually have contention on reduce slots, not map slots. Using some of graphite draw in second Y axis features I can look at the number of jobs in a queue versus the number of reduces (completed + total). Flat lines lets me know a long running reduce task, or a lot of intermediate output.

Reduce Slots I currently have a minShare policy on a queue that has a strict SLA. It was nice to see that the share was being met as shown by the blue line drawn. This queue always gets its slots via preemption. You can also see we have long periods of contention as 100% of the reduce slots are taken for hours at a time. Using this graph I am currently tuning the cluster to give up map slots in favor of reduce slots in addition to increasing the block size as map tasks are completing very quickly.

Queue Info This is the same graph as the map side by dividing the total number of reduce slots requested by reduces completed. Cross referencing this with our other network graphs shows a lot of intermediate data going across the network.

All in all I have had good success with a very simple metrics collection tool to help me identify some bottlenecks on my cluster setup. This is what I have come up with so far.

  • Increase the number of reduce slots.Checking other metrics shows we have network bandwidth + disk IOPS to spare.
  • Increase the amount of work done by a map task as they are completing very quickly.
  • Shuffle some job start times to take advantage of idle resources.

Let me know if anybody has a chance to use the tool to identify issues with their cluster. I’d love to hear some feedback.

Hadoop Fair Scheduler Monitoring

So today I received a request to help debug fair scheduler performance on one of our large Hadoop clusters. Normally this is the point where I would point to installing a tool like Cloudera Manager, but we do not have CM running anywhere within our environment. So I took a quick look around GitHub to see if anybody has written any scripts to monitor the fair scheduler allocations and found nothing. We currently have monitoring on the Hadoop / HDFS level, but are lacking visibility at the individual scheduler / pool level. Questions arise such as this that I cannot answer without a cool made-up story.

  • My job ran very slow last night, can you take a look at the cluster to see if anything is wrong?
  • This set of Oozie jobs normally takes 2 hours, but over the past few days it has been taking 4 hours easy run, why is that?
  • Can you check the network? Looks like my jobs have been running very slowly.

Now at this point, I check the regular cluster health with our fleet of monitoring tools based on Graphite / OpenTSDB. Looking at HDFS health, I see no instances of failed datanodes, 100 mbit links, errors in the log files, et al. While looking at basic hadoop-metrics from our logging context, I see that the cluster is almost always 100% utilized on map slots / reduce slots. The next obvious question is ‘Who is running jobs stealing my resources?’. Before I can start to create fair-share policies, and pick winners and losers of precious cluster resources via preemption I need to know demand. I need to know how many jobs are running in each pool, and how many slots they require to finish there tasks. I would also love to monitor preemption requests to see when pools start killing other tasks to meet their fair-share or min-share. I set out to create a simple tool to query the http://jobtracker:50030/scheduler?advanced web interface on a timer, and send this metadata to Graphite / OpenTSDB on a real-time basis. I could then create visualizations to see demand and allocation to help craft fair share policies. It’s not perfect as it does not look at individual job performance (Input splits, min / max task completion time) but is really helpful on a high level.

I wanted something quick and easy to get done, so I took about 2 hours to create a Ruby Mechanize script to screen scrape the jobtracker fair-scheduler page. I then turn the output into a KeyValue format that I can use with OpenTSDB / Graphite. I did not want to create a lot of unique keys due to issues with Graphite creating a Whisper database per unique point. So capturing job-name, task-id is unacceptable. I instead aggregate metrics by user, or pool. In the case of our cluster job pool names are users. So I aggreagte the metrics by pool name, then send off to our visualiation system for further planning.

I created a project above if you would like to hack on the code. Currently I am using Diamond https://github.com/BrightcoveOS/Diamond to schedule checks via the UserScriptsCollector for ruby programs. It wants Key + Value, and fills in the date for you based on your scheduler. I can then send the metrics to both OpenTSDB + Graphite with the same system via Handlers. Below is some graphs of the things you can do with this information.

As you can see by the graph, it details a nice break down of slots utilization by fair scheduler pool. In this image, 3 different pools are racing for resources as the cluster is 100% utilized. You can dive into other metrics such as total tasks scheduled (map / reduce) vs. resources available at that time. Let me know if you find this tool helpful. Pull requests are always welcome.

Open Sourcing Some Work

I’ve made it a point to start contributing more of my work back to the open source community i’ve gained so much from. So in next few days with permission from my employer, I am going to start contributing some simple tools that have made my life easier over here at Klout. Here is a list of some of the things i’m working on.

  • Storm Puppet module
  • Storm debian packaging
  • HBase rolling compact tool with ZooKeeper locks
  • HBase major_compaction script with ZooKeeper locks
  • HBase metrics collection script
  • HBase per-table load balancing calculation script
  • Hadoop GraphiteClient context
  • Kafka JMX Monitoring script
  • Diamond handler for Kafka
  • OpenTSDB GnuPlot options
  • MCollective Monit agent
  • Hadoop hiera puppet module
  • HBase hiera puppet module