Humble Fail: Is tech DIY worth it?

What happened?

I run a splash site to easily link to all my profile sites (think LinkTree but I own it). The site is built on Astro and uses tailwindcss and Astro Icon.

I was in the process of adding new profile links for D&D Beyond and Roll20. I could run the development server locally and everything worked, but the public site wasn’t updating. I go to check my GitHub Actions logs and find this message:

Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3, actions/setup-node@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.

I follow the link and read on, the change is as simple as changing my workflow to use v4 and specifying Node.js 20. I push the change, only to find that it broke again. This time it was the astro-icon module. I was running v0.8.1 so I upgraded to v1.1.0 (following instructions), but it kept failing. I spent hours searching for a fix, and finally was going to file a bug report on astro-icon, when I came across their issue template that reads:

✅ I am using the latest version of Astro Icon.
✅ Astro Icon has been added to my astro.config.mjs file as an integration.
✅ I have installed the corresponding @iconify-json/* packages.
✅ I am using the latest version of Astro and all plugins.
✅ I am using a version of Node that Astro supports (>=18.14.1)

Source: https://github.com/natemoo-re/astro-icon/blob/main/.github/ISSUE_TEMPLATE/bug.yml

I’m typically quick to dismiss these, but my days in support meant that I had to run through each of them. I get down to “the latest version of Astro”. I’m running v2.9.7–which sounds like the latest version. Out of curiosity, what’s the latest version?

v4.3.2 🤬

Sure enough, upgrading fixed it.

What did you learn?

There were a few takeaways from my Saturday morning shenanigans:

  1. “Build over buy” still has consequences. I didn’t want to pay Linktree’s subscription, thinking I could maintain the site cheaper. I still think that’s a good choice, but had I used LinkTree then I wouldn’t have lost this Saturday morning. I also chose to build my own because of the level of customizations I wanted/plan to add. I chose “build over buy”, and got burned 🔥 (a little).
  2. “Sharpening the saw” only works if you keep up with it. I chose Astro after seeing someone else’s site built that way (sorry don’t remember who) and I liked the simple use of tags to convey intent, then putting the design of it (which is repeated many times) into a separate file. I don’t use Astro for anything else, and that directly contributed to how long I spent on the issue. I feel that I would have solved this almost immediately had I spent more than 20 minutes every 6 months using Astro.
  3. Corporations go through this on a much larger scale. My problem is IDENTICAL to major corporations who invest in agile development, then ignore the practices. If I spent more time working on this splash site, I would’ve kept it updated and would have built up the experience to know to check for the latest version. When I didn’t–I spent a lot of time trying to figure out why.
  4. Good community hygiene works. I now avoided filing an issue on a project because of Astro Icon’s issue template. I don’t see good issue templates often, but this one was concise and direct…and showed me the problem.

Ultimately, this problem got me thinking about whether DIY in tech is worth it. I don’t think I considered troubleshooting time when I decided to “build”, but I still like the end result and will continue to build my splash site. I’ve also gone back on this blog between Hugo and WordPress (and different vendors). The key is knowing and understanding the tradeoffs, then being able to move when the need arises.

Automatically Deploy Hugo Blog to Amazon S3

I had grand aspirations of maintaining a personal blog on a weekly basis, but sometimes that isn’t always possible. I’ve been using my iPad and Working Copy to write posts, but had to use my regular computer to build and publish. CI/CD pipelines help, but I couldn’t find the right security and cost optimizations for my use case…until this year.

My prior model had my blog stored on GitLab because it enabled a free private repository (mainly to hide drafts and future posts). I was building locally using a Docker container and then uploading to Amazon S3 via a script.

At the beginning of the year, GitHub announced free private repositories (for up to 3 contributors), and I promptly moved my repo to GitHub. (NOTE: I don’t use CodeCommit because it’s more difficult to plumb together with Working Copy.)

I was able to now plumb CodePipeline and CodeBuild to build my site, but fell short on deploying my blog to S3. I had to build a Lambda function to extract the artifact and upload to S3. The function is only 20 lines of Python, so it wasn’t difficult.

But then, AWS announced deploying to S3 from CodePipeline, meaning my Lambda function was useful for exactly 10 days!

Now, I can write a post from my iPad and publish it to my blog with a simple commit on the iPad! It’s a good start to 2019, with (hopefully) more topics coming soon…

Rotate IAM Access Keys

How often do you change your password?

Within AWS is a service called Trusted Advisor. Trusted Advisor runs checks in an AWS account looking for best practices around Cost Optimization, Fault Tolerance, Performance, and Security.

In the Security section, there’s a check (Business and Enterprise Support only) for the age of an Access Key attached to an IAM user. The Trusted Advisor check that will warn for any key older than 90 days and alert for any key older than 2 years. AWS recommends rotating the access keys for each IAM user in the account.

From Trusted Advisor Best Practices (Checks):

Checks for active IAM access keys that have not been rotated in the last 90 days. When you rotate your access keys regularly, you reduce the chance that a compromised key could be used without your knowledge to access resources. For the purposes of this check, the last rotation date and time is when the access key was created or most recently activated. The access key number and date come from the access_key_1_last_rotated and access_key_2_last_rotated information in the most recent IAM credential report.

The reason for these times is the mean time to crack an access key. Using today’s standard processing unit, and AWS Access Key could take xxx to crack, and users should rotate their Access Key before that time.

Yet in my experience, this often goes unchecked. I’ve come across an Access Key that was 4.5 years old! I asked why not change it, and the answer is mostly the same–the AWS Administrators and Security teams do not own and manage the credential, and the user doesn’t want to change the credential for fear it will break their process.

Rotating an AWS Access Key is not difficult. It’s a few simple commands to the AWS CLI (which you presumably have installed if you have an Access Key).

  1. Create a new access key (CreateAccessKey API)
  2. Configure AWS CLI to use the new access key (aws configure)
  3. Disable the old access key (UpdateAccessKey API)
  4. Delete the old access key (DeleteAccessKey API)

Instead of requiring each user to remember the correct API calls and parameters to each, I’ve created a script in buzzsurfr/aws-utils called rotate_access_key.py that orchestrates the process. Written in Python (a dependency of AWS CLI, so again should be present), the script minimizes the number of parameters and removes the undifferentiated heavy lifting associated with selecting the correct key. The user’s access is confirmed to be stable by using the new access key to remove the old access key. The script can be scheduled using crown or Scheduled Tasks and supports CLI profiles.

usage: rotate_access_key.py [-h] --user-name USER_NAME
                            [--access-key-id ACCESS_KEY_ID]
                            [--profile PROFILE] [--delete] [--no-delete]
                            [--verbose]

optional arguments:
  -h, --help            show this help message and exit
  --user-name USER_NAME
                        UserName of the AWS user
  --access-key-id ACCESS_KEY_ID
                        Specific Access Key to replace
  --profile PROFILE     Local profile
  --delete              Delete old access key after inactivating (Default)
  --no-delete           Do not delete old access key after inactivating
  --verbose             Verbose

In order to use the script, the user must have the right set of permissions for their IAM user. This template is an example and only grants the IAM user permissions to change their own access Key.

From IAM: Allows IAM Users to Rotate Their Own Credentials Programmatically and in the Console:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:ListUsers",
                "iam:GetAccountPasswordPolicy"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:*AccessKey*",
                "iam:ChangePassword",
                "iam:GetUser",
                "iam:*ServiceSpecificCredential*",
                "iam:*SigningCertificate*"
            ],
            "Resource": ["arn:aws:iam::*:user/${aws:username}"]
        }
    ]
}

This script is designed for users to rotate their credentials. This does not apply for “service accounts” (where the credential is configured on a server or unattended machine). If the machine is an EC2 Instance or ECS Task, then attaching an IAM Role to the instance or task will automatically handle rotating the credential. If the machine is on-premise or hosted elsewhere, then adapt the script to work unattended (I’ve thought about coding it as well).

As an AWS Administrator, you cant simply pass out the script and expect all users to rotate their access keys on time. Remember to build the system around it. Periodically query the TA check looking for access keys older than 90 days (warned), and send that user a reminder to rotate their access key. Take it a step further by automatically disabling access keys older than 120 days (warn them in the reminder). Help create good security posture and a good experience for your users, and make your account more secure at the same time!

F5 Archive

Back in 2013, I led a “proof of concept” test for an enterprise-grade load balancing solution. We evaluated many products, but had a shortlist of 4 vendors, and ultimately selected F5 Networks. While the selection criteria was different, I personally liked F5’s extensibility. I continued to work with F5 for a few years, earning my professional-level certification and engaging with the DevCentral community.

Management API

While many network professionals grew up on CLI-based tools, at that time I knew the importance of having an API for managing devices. While CLI-based tools work, they offer very little in programmability and orchestration. Any orchestrated solution using a CLI has to account for the various ways of connecting to the CLI–which are always subject to change by the vendor. APIs offer a standard interface for connecting to and managing a device, and are often themselves extended by a provided CLI or SDK that communicates with the API.

F5’s original “iControl” API was a SOAP-based API. Anyone who wrote a SOAP API call knows why they stopped, but F5 also provided bigsuds, a Python library that would call the API. Bigsuds made it easy to programmatically connect to any F5 and accomplish any goal.

I created a set of bigsuds scripts and published them to buzzsurfr/f5-bigsuds-utils and DevCentral. They range from connecting to the active device in a HA pair to locating orphaned iRules (iRules not associated with a Virtual Server) to finding a Pool/Virtual Server based on a Node IP Address.

In 2013, F5 also released their first version of iControlREST, a REST-based API, and the f5-sdk, which offered a cleaner interface and object-oriented code for maintaining a F5 device. I converted some of my scripts to use the f5-sdk and again pushed them to buzzsurfr/f5-sdk-utils and DevCentral.

Programmable Logic

Hardware vendors have historically struggled with keeping up with the pace of innovation in technology. One time, we were evaluating a core network refresh. Instead of discussing what the products can do, we spent more time discussing what they will do in the future. I recall a colleague asking all the major vendors when they would support TRILL (don’t judge {{< emoji “:smiley:” >}}). Almost always, the answer required new hardware, and it would be no sooner than 18 months.

While I understand the need to put this type of logic directly into hardware, why not have a stopgap? Put a process in place to code the feature in software, then promote it to hardware at a later date. F5 was the first time I saw this business model, and I was immediately drawn to it. If the F5 didn’t have a feature I needed, then I just wrote the logic in an iRule. iRules take my logic and process it as part of the F5’s routing logic. Suddenly, I stopped asking my F5 representatives about when a feature would release and instead on how I could program that feature myself.

F5’s come with preloaded iRules, but I had to create my own over time, and collected them in buzzsurfr/f5-iRules. A few examples:

  • One time, I had a customer with a broken app that would intermittently respond with multiple Content-Length headers (which breaks RFC 2616). They weren’t sure why, but it needed to be fixed. We fixed it with an iRule until they could find and resolve the bug in the application. This wasn’t a load balancing problem, but we still used the load balancer to workaround the problem and remove customer pain.
  • I had a need to implement Content Switching, which wasn’t supported by F5 at the time. With iRules, I was able to create content switching at both the host and path until the F5 supported content switching.

I don’t spend much time with F5 products these days, but I still use the programmable logic model. In my current role at AWS, I often find gaps in features that are needed by my customers, and many times we’re able to develop a Lambda function to fill the gap until the feature is released. I’ve watched this same model serve both F5 Networks and AWS well, and I hope the trend continues with other products as we continue to evolve.

My F5-based repositories

Docker Hugo

After restarting my blog, I wanted a way to automate my workflow. I currently work for AWS, and want to use the features of the cloud to manage and deploy my blog, but for as little cost as possible. The lowest cost for a static site like mine is Amazon S3, which offers to host the objects in the bucket as a static website.

This starts by adopting a solid framework for building static sites. After trying a few, I selected Hugo. I had been using mkdocs for training/tutorials but felt it lacked a good native layout engine and wasn’t a good fit for a blog.

I followed the installation instructions, but wanted something I could containerize (since it’s relevant to my current work). Thus, I created docker-hugo as a simple project to containerize hugo.

For now, this includes a README and a Dockerfile (copied as of August 3, 2018):

FROM centos:latest as builder

RUN yum -y update
RUN curl -sL -o hugo.tar.gz https://github.com/gohugoio/hugo/releases/download/v0.46/hugo_0.46_Linux-64bit.tar.gz && tar zxf hugo.tar.gz hugo

FROM scratch

COPY --from=builder /hugo .

VOLUME /host
WORKDIR /host
EXPOSE 1313

ENTRYPOINT ["/hugo"]

While a simple example, it does combine some newer Docker features. I used a multi-stage build to download the actual binary, then a scratch image for the actual deployment. The README highlights the syntax I use for the command, and an alias for being able to run hugo new posts/docker-hugo.md with all of my environment variables already plugged in. This can also be adapted for a future CI/CD process.

Add Athena Partition for ELB Access Logs

If you’ve worked on a load balancer, then at some point you’ve been witness to the load balancer taking the blame for an application problem (like a rite of passage). This used to be difficult to exonerate, but with AWS Elastic Load Balancing you can capture Access Logs (Classic and Application only) and very quickly identify whether the load balancer contributed to the problem.

Much like any log analysis, the volume of logs and frequency of access are key to identify the best log analysis solution. If you have a large store of logs but infrequently access them, then a low-cost option is Amazon Athena. Athena enables you to run SQL-based queries against your data in S3 without an ETL process. The data is durable and you only pay for the volume of data scanned per query. AWS also includes documentation and templates for querying Classic Load Balancer logs and Application Load Balancer logs.

This is a great model, but with a potential flaw–as the data set grows in size, the queries become slower and more expensive. To remediate, Amazon Athena allows you to partition your data. This restricts the amount of data scanned, thus lowering costs and increasing speed of the query.

ELB Access Logs store the logs in S3 using the following format:

s3://bucket[/prefix]/AWSLogs/{{AccountId}}}}/elasticloadbalancing/{{region}}/{{yyyy}}/{{mm}}/{{dd}}/{{AccountId}}_elasticloadbalancing_{{region}}_{{load-balancer-name}}_{{end-time}}_{{ip-address}}_{{random-string}}.log

Since the prefix does not pre-define partitions, the partitions must be created manually. Instead of creating partitions ad-hoc, create a CloudWatch Scheduled Event that runs daily targeted at a Lambda function that adds the partition. To simplify the process, I created buzzsurfr/athena-add-partition.

This project is both the Lambda function code and a CloudFormation template to deploy the Lambda function and the CloudWatch Scheduled Event. Logs are sent from the Load Balancer into a S3 bucket. Daily, the CloudWatch Scheduled Event will invoke the Lambda function to add a partition to the Athena table.

Using the partitions requires modifying the SQL query used in the Athena console. Consider the basic query to return all records: SELECT * FROM logs.elb_logs. Add/append to a WHERE clause including the partition keys with values. For example, to query only the records for July 31, 2018, run:

SELECT *
FROM logs.elb_logs
WHERE
  (
    year = '2018' AND
    month = '07' AND
    day = '31'
  )

This query with partitions enabled restricts Athena to only scanning

s3://bucket/prefix/AWSLogs/{{AccountId}}/elasticloadbalancing/{{region}}/2018/07/31/

instead of

s3://bucket/prefix/AWSLogs/{{AccountId}}/elasticloadbalancing/{{region}}/

resulting in a significant reduction in cost and processing time.

Using partitions also makes it easier to enable other Storage Classes like Infrequent Access, where you pay less to store but pay more to access. Without partitions, every query would scan the bucket/prefix and potentially cost more due to the access cost for objects with Infrequent Access storage class.

This model can be applied to other logs stored in S3 that do not have pre-defined partitions, such as CloudTrail logsCloudFront logs, or for other applications that export logs to S3, but don’t allow modifications to the organizational structure.