Match Containers to Host Processes

During my presentation Securing Container Workloads on AWS Fargate, I built a demo environment where I could build and run various containers and show the effect they had on the host. While my demo went well, a key piece of feedback is that customers liked how I presented the demo environment by having containers and their host processes on one side. To that end, I’ll show you.

Containers Pane

To show the currently running containers on a given host, use docker ps. The normal format (for v18.09.1) looks like:

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
525a7b49ef67        nginx               "nginx -g 'daemon of…"   About an hour ago   Up About an hour    80/tcp                   tender_shirley

However, for this demo, I was only concerned with the name, image, command, and current status (which has the time it’s been running), so I formatted the output using the --format flag, and stuck it inside watch to update every second.

Command
watch -n 1 "docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Command}}\t{{.Status}}'"
Output
Every 1.0s: docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Command}}\t{{.Status}}' localhost.localdomain: Sat Feb 23 13:44:58 2019 NAMES IMAGE COMMAND STATUS tender_shirley nginx "nginx -g 'daemon of " Up About an hour

Host Processes Pane

Getting the host processes (and a way to map them to containers) was more difficult. The best tool in Linux for looking at processes is ps (which is where Docker gets the name for docker ps), but this doesn’t give us all the information about a container.

When a container starts, it spawns as a process with a specific process identifier (PID) in the host, but the container sees the PID as 1. This process can also spawn other processes, which will reference ther parent process PPID. Subprocesses will show up with a PPID of the main process PID but inside the container as PPID 1. For my demo, I wanted to show both the processes and subprocesses at the host, and include information about the user running each process.

Thus, I built a script called watchpids.sh. This script gathered the host PIDs, found all of the children PIDs and then fed the list of PIDs to ps, also formatting the list to show the running time of the process, the PID, the PPID, the user associated with the process, and the command run. Again, execution of the script was wrapped in watch.

Script
#!/bin/bash
pidlist() {
local thispid=$1
local fulllist=
local childlist=
childlist=$(ps --ppid $thispid -o pid h)
for pid in $childlist
do
fulllist="$(pidlist $pid) $fulllist"
done
echo "$thispid $fulllist"
}
dockerpids() {
local fulllist=
local dockerpids="$(pidof docker-containerd-shim)"
for p in ${dockerpids}
do
for q in $(ps --ppid $p -o pid h)
do
fulllist="$(pidlist $q) $fulllist"
done
done
echo "$fulllist"
}
ary=($(dockerpids))
DOCKER_PIDS=${ary[@]}
ps -Ho etime,pid,ppid,user,cmd --pid ${DOCKER_PIDS// +/,} 2> /dev/null
view raw watchpids.sh hosted with ❤ by GitHub

With both the containers and processes displayed, map the container STATUS to the host process ELAPSED time to see what processes show up on the host whenever a new container is started.

Terminal Window

Tying it all together, I used tmux to build the container and host process panes on the right, and an area to type commands on the left.

tmux uses either keyboard shortcuts or commands inside the session to change the environment–going for a “scripting” approach, I chose the latter.

Commands
tmux new-session -d -s builder_demo
tmux split-window -h
tmux split-window -dv "watch -n 1 \"docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Command}}\t{{.Status}}'\""
tmux select-pane -t 0
tmux send-keys -t 1 'watch -n 1 ./watchpids.sh' C-m
tmux -2 attach-session -d
Screenshot

Securing Container Workloads on AWS Fargate

When containers first became mainstream (think PyCon 2013 with Solomon Hykes on stage), everyone thought it had potential and began to test running containers on their own, but almost no one set out to put containers in production that day. They wanted to see it battle-tested…which has happened over time. Containers have matured from an emerging technology to production-ready where it’s generally considered safe, but there’s a new problem. Now, we need our business processes, tools, and architecture models to mature as well.

The top ask I hear about containers comes down to security. Containers were built as a way to isolate workloads for one another, but many of the security models from virtual machines do not work in containers, and thus we must evolve our thought process.

To that end, I presented an AWS Online Tech Talk about how to secure container workloads using AWS Fargate (though many of the lessons also apply generally across containers). I demonstrated some quick steps to make your containers more secure during the build process as well as how to enhance visibility and security around containers running in AWS Fargate.

What’s Your Exit Strategy?

Why are we afraid of “lock in”? Typically we hear the term and automatically assume it’s bad. It certainly can be, but doesn’t mean that every situation you’re in is a bad one.

On February 8, 2019, I gave an Ignite talk regarding Exit Strategies and “lock in” at DevOpsDays Charlotte. We broke down “lock in” and the varying degrees of it, then talked about how you can use it to your advantage by having an Exit Strategy (which is exactly as it sounds).

“lock in” isn’t exclusive to technology–what about your current employer?

Automatically Deploy Hugo Blog to Amazon S3

I had grand aspirations of maintaining a personal blog on a weekly basis, but sometimes that isn’t always possible. I’ve been using my iPad and Working Copy to write posts, but had to use my regular computer to build and publish. CI/CD pipelines help, but I couldn’t find the right security and cost optimizations for my use case…until this year.

My prior model had my blog stored on GitLab because it enabled a free private repository (mainly to hide drafts and future posts). I was building locally using a Docker container and then uploading to Amazon S3 via a script.

At the beginning of the year, GitHub announced free private repositories (for up to 3 contributors), and I promptly moved my repo to GitHub. (NOTE: I don’t use CodeCommit because it’s more difficult to plumb together with Working Copy.)

I was able to now plumb CodePipeline and CodeBuild to build my site, but fell short on deploying my blog to S3. I had to build a Lambda function to extract the artifact and upload to S3. The function is only 20 lines of Python, so it wasn’t difficult.

But then, AWS announced deploying to S3 from CodePipeline, meaning my Lambda function was useful for exactly 10 days!

Now, I can write a post from my iPad and publish it to my blog with a simple commit on the iPad! It’s a good start to 2019, with (hopefully) more topics coming soon…

Rotate IAM Access Keys

How often do you change your password?

Within AWS is a service called Trusted Advisor. Trusted Advisor runs checks in an AWS account looking for best practices around Cost Optimization, Fault Tolerance, Performance, and Security.

In the Security section, there’s a check (Business and Enterprise Support only) for the age of an Access Key attached to an IAM user. The Trusted Advisor check that will warn for any key older than 90 days and alert for any key older than 2 years. AWS recommends rotating the access keys for each IAM user in the account.

From Trusted Advisor Best Practices (Checks):

Checks for active IAM access keys that have not been rotated in the last 90 days. When you rotate your access keys regularly, you reduce the chance that a compromised key could be used without your knowledge to access resources. For the purposes of this check, the last rotation date and time is when the access key was created or most recently activated. The access key number and date come from the access_key_1_last_rotated and access_key_2_last_rotated information in the most recent IAM credential report.

The reason for these times is the mean time to crack an access key. Using today’s standard processing unit, and AWS Access Key could take xxx to crack, and users should rotate their Access Key before that time.

Yet in my experience, this often goes unchecked. I’ve come across an Access Key that was 4.5 years old! I asked why not change it, and the answer is mostly the same–the AWS Administrators and Security teams do not own and manage the credential, and the user doesn’t want to change the credential for fear it will break their process.

Rotating an AWS Access Key is not difficult. It’s a few simple commands to the AWS CLI (which you presumably have installed if you have an Access Key).

  1. Create a new access key (CreateAccessKey API)
  2. Configure AWS CLI to use the new access key (aws configure)
  3. Disable the old access key (UpdateAccessKey API)
  4. Delete the old access key (DeleteAccessKey API)

Instead of requiring each user to remember the correct API calls and parameters to each, I’ve created a script in buzzsurfr/aws-utils called rotate_access_key.py that orchestrates the process. Written in Python (a dependency of AWS CLI, so again should be present), the script minimizes the number of parameters and removes the undifferentiated heavy lifting associated with selecting the correct key. The user’s access is confirmed to be stable by using the new access key to remove the old access key. The script can be scheduled using crown or Scheduled Tasks and supports CLI profiles.

usage: rotate_access_key.py [-h] --user-name USER_NAME
                            [--access-key-id ACCESS_KEY_ID]
                            [--profile PROFILE] [--delete] [--no-delete]
                            [--verbose]

optional arguments:
  -h, --help            show this help message and exit
  --user-name USER_NAME
                        UserName of the AWS user
  --access-key-id ACCESS_KEY_ID
                        Specific Access Key to replace
  --profile PROFILE     Local profile
  --delete              Delete old access key after inactivating (Default)
  --no-delete           Do not delete old access key after inactivating
  --verbose             Verbose

In order to use the script, the user must have the right set of permissions for their IAM user. This template is an example and only grants the IAM user permissions to change their own access Key.

From IAM: Allows IAM Users to Rotate Their Own Credentials Programmatically and in the Console:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:ListUsers",
                "iam:GetAccountPasswordPolicy"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:*AccessKey*",
                "iam:ChangePassword",
                "iam:GetUser",
                "iam:*ServiceSpecificCredential*",
                "iam:*SigningCertificate*"
            ],
            "Resource": ["arn:aws:iam::*:user/${aws:username}"]
        }
    ]
}

This script is designed for users to rotate their credentials. This does not apply for “service accounts” (where the credential is configured on a server or unattended machine). If the machine is an EC2 Instance or ECS Task, then attaching an IAM Role to the instance or task will automatically handle rotating the credential. If the machine is on-premise or hosted elsewhere, then adapt the script to work unattended (I’ve thought about coding it as well).

As an AWS Administrator, you cant simply pass out the script and expect all users to rotate their access keys on time. Remember to build the system around it. Periodically query the TA check looking for access keys older than 90 days (warned), and send that user a reminder to rotate their access key. Take it a step further by automatically disabling access keys older than 120 days (warn them in the reminder). Help create good security posture and a good experience for your users, and make your account more secure at the same time!

F5 Archive

Back in 2013, I led a “proof of concept” test for an enterprise-grade load balancing solution. We evaluated many products, but had a shortlist of 4 vendors, and ultimately selected F5 Networks. While the selection criteria was different, I personally liked F5’s extensibility. I continued to work with F5 for a few years, earning my professional-level certification and engaging with the DevCentral community.

Management API

While many network professionals grew up on CLI-based tools, at that time I knew the importance of having an API for managing devices. While CLI-based tools work, they offer very little in programmability and orchestration. Any orchestrated solution using a CLI has to account for the various ways of connecting to the CLI–which are always subject to change by the vendor. APIs offer a standard interface for connecting to and managing a device, and are often themselves extended by a provided CLI or SDK that communicates with the API.

F5’s original “iControl” API was a SOAP-based API. Anyone who wrote a SOAP API call knows why they stopped, but F5 also provided bigsuds, a Python library that would call the API. Bigsuds made it easy to programmatically connect to any F5 and accomplish any goal.

I created a set of bigsuds scripts and published them to buzzsurfr/f5-bigsuds-utils and DevCentral. They range from connecting to the active device in a HA pair to locating orphaned iRules (iRules not associated with a Virtual Server) to finding a Pool/Virtual Server based on a Node IP Address.

In 2013, F5 also released their first version of iControlREST, a REST-based API, and the f5-sdk, which offered a cleaner interface and object-oriented code for maintaining a F5 device. I converted some of my scripts to use the f5-sdk and again pushed them to buzzsurfr/f5-sdk-utils and DevCentral.

Programmable Logic

Hardware vendors have historically struggled with keeping up with the pace of innovation in technology. One time, we were evaluating a core network refresh. Instead of discussing what the products can do, we spent more time discussing what they will do in the future. I recall a colleague asking all the major vendors when they would support TRILL (don’t judge {{< emoji “:smiley:” >}}). Almost always, the answer required new hardware, and it would be no sooner than 18 months.

While I understand the need to put this type of logic directly into hardware, why not have a stopgap? Put a process in place to code the feature in software, then promote it to hardware at a later date. F5 was the first time I saw this business model, and I was immediately drawn to it. If the F5 didn’t have a feature I needed, then I just wrote the logic in an iRule. iRules take my logic and process it as part of the F5’s routing logic. Suddenly, I stopped asking my F5 representatives about when a feature would release and instead on how I could program that feature myself.

F5’s come with preloaded iRules, but I had to create my own over time, and collected them in buzzsurfr/f5-iRules. A few examples:

  • One time, I had a customer with a broken app that would intermittently respond with multiple Content-Length headers (which breaks RFC 2616). They weren’t sure why, but it needed to be fixed. We fixed it with an iRule until they could find and resolve the bug in the application. This wasn’t a load balancing problem, but we still used the load balancer to workaround the problem and remove customer pain.
  • I had a need to implement Content Switching, which wasn’t supported by F5 at the time. With iRules, I was able to create content switching at both the host and path until the F5 supported content switching.

I don’t spend much time with F5 products these days, but I still use the programmable logic model. In my current role at AWS, I often find gaps in features that are needed by my customers, and many times we’re able to develop a Lambda function to fill the gap until the feature is released. I’ve watched this same model serve both F5 Networks and AWS well, and I hope the trend continues with other products as we continue to evolve.

My F5-based repositories

Docker Hugo

After restarting my blog, I wanted a way to automate my workflow. I currently work for AWS, and want to use the features of the cloud to manage and deploy my blog, but for as little cost as possible. The lowest cost for a static site like mine is Amazon S3, which offers to host the objects in the bucket as a static website.

This starts by adopting a solid framework for building static sites. After trying a few, I selected Hugo. I had been using mkdocs for training/tutorials but felt it lacked a good native layout engine and wasn’t a good fit for a blog.

I followed the installation instructions, but wanted something I could containerize (since it’s relevant to my current work). Thus, I created docker-hugo as a simple project to containerize hugo.

For now, this includes a README and a Dockerfile (copied as of August 3, 2018):

FROM centos:latest as builder

RUN yum -y update
RUN curl -sL -o hugo.tar.gz https://github.com/gohugoio/hugo/releases/download/v0.46/hugo_0.46_Linux-64bit.tar.gz && tar zxf hugo.tar.gz hugo

FROM scratch

COPY --from=builder /hugo .

VOLUME /host
WORKDIR /host
EXPOSE 1313

ENTRYPOINT ["/hugo"]

While a simple example, it does combine some newer Docker features. I used a multi-stage build to download the actual binary, then a scratch image for the actual deployment. The README highlights the syntax I use for the command, and an alias for being able to run hugo new posts/docker-hugo.md with all of my environment variables already plugged in. This can also be adapted for a future CI/CD process.

Add Athena Partition for ELB Access Logs

If you’ve worked on a load balancer, then at some point you’ve been witness to the load balancer taking the blame for an application problem (like a rite of passage). This used to be difficult to exonerate, but with AWS Elastic Load Balancing you can capture Access Logs (Classic and Application only) and very quickly identify whether the load balancer contributed to the problem.

Much like any log analysis, the volume of logs and frequency of access are key to identify the best log analysis solution. If you have a large store of logs but infrequently access them, then a low-cost option is Amazon Athena. Athena enables you to run SQL-based queries against your data in S3 without an ETL process. The data is durable and you only pay for the volume of data scanned per query. AWS also includes documentation and templates for querying Classic Load Balancer logs and Application Load Balancer logs.

This is a great model, but with a potential flaw–as the data set grows in size, the queries become slower and more expensive. To remediate, Amazon Athena allows you to partition your data. This restricts the amount of data scanned, thus lowering costs and increasing speed of the query.

ELB Access Logs store the logs in S3 using the following format:

s3://bucket[/prefix]/AWSLogs/{{AccountId}}}}/elasticloadbalancing/{{region}}/{{yyyy}}/{{mm}}/{{dd}}/{{AccountId}}_elasticloadbalancing_{{region}}_{{load-balancer-name}}_{{end-time}}_{{ip-address}}_{{random-string}}.log

Since the prefix does not pre-define partitions, the partitions must be created manually. Instead of creating partitions ad-hoc, create a CloudWatch Scheduled Event that runs daily targeted at a Lambda function that adds the partition. To simplify the process, I created buzzsurfr/athena-add-partition.

This project is both the Lambda function code and a CloudFormation template to deploy the Lambda function and the CloudWatch Scheduled Event. Logs are sent from the Load Balancer into a S3 bucket. Daily, the CloudWatch Scheduled Event will invoke the Lambda function to add a partition to the Athena table.

Using the partitions requires modifying the SQL query used in the Athena console. Consider the basic query to return all records: SELECT * FROM logs.elb_logs. Add/append to a WHERE clause including the partition keys with values. For example, to query only the records for July 31, 2018, run:

SELECT *
FROM logs.elb_logs
WHERE
  (
    year = '2018' AND
    month = '07' AND
    day = '31'
  )

This query with partitions enabled restricts Athena to only scanning

s3://bucket/prefix/AWSLogs/{{AccountId}}/elasticloadbalancing/{{region}}/2018/07/31/

instead of

s3://bucket/prefix/AWSLogs/{{AccountId}}/elasticloadbalancing/{{region}}/

resulting in a significant reduction in cost and processing time.

Using partitions also makes it easier to enable other Storage Classes like Infrequent Access, where you pay less to store but pay more to access. Without partitions, every query would scan the bucket/prefix and potentially cost more due to the access cost for objects with Infrequent Access storage class.

This model can be applied to other logs stored in S3 that do not have pre-defined partitions, such as CloudTrail logsCloudFront logs, or for other applications that export logs to S3, but don’t allow modifications to the organizational structure.

Blog Restart

It’s been over 10 years since I had a blog, or at least maintained one. I want to promote my personal brand but have often not put forth the effort. I have a significant amount of experience, so it’s just a matter of putting my experiences down “on paper”…and having the right tool to publish.

Enter Hugo. I’ve been a fan of Markdown for awhile, and make avid use of it for projects on GitHub or written for mkdocs. I wanted something that could deploy to a static site since my actual code rarely changes, and to save overall costs. My current site is built on GitHub Pages, but does not allow me the necessary capabilities, and I wanted something similar to mkdocs but that I could easily deploy a scaffold and work.

While I often travel with a laptop, I’ve also been looking at my mobile productivity, and I feel that I could accomplish more by using my mobile device. When I have an idea, I want to commit quickly. My tablet is an easy way to do so since it takes up less room and has less time to boot, but has lacked a sufficient productivity tool.

I’ve typed up this post partially using Working Copy. For me, it has the right blend of git integration and file editor (with Markdown syntax highlighting). You can’t push without the in-app purchase, but the free version plus a 10-day trial lets you test before buying, which let me make sure it fits my workflow.

For my blog content, I plan to document my experiences through my IT journey in hopes that it will also help others. I’ve always embraced the IT community, and a blog is my latest way of giving back. I’ve always struggled with trying to get the best structure and methods before pushing something new, and that’s always led to me never launching. This time, I’m accepting that the blog may not be perfect, but it’s out there and functional. I’ll be able to make improvements over time and grow this into a resource for all.