Bootstrapping pi-bernetes: including the wheels

In a previous post, I shared my journey through creating a repeatable build of my homelab cluster using ansible. I can now rebuild Kubernetes anytime I need/want to, but what should I do with it?

Finding my problem while eating humble pie

One idea is to have a locally-hosted all-in-one git service like Gitea. In previous builds, I started installing gitea using a helm chart. I could then forward the port to my local workstation and I had git!

However, I’m not always at that workstation and need to access gitea without necessarily using kubectl, so I opted to create a Load Balancer. K3s does include ServiceLB, but it lacks features and didn’t work out of the box on my network. MetalLB has the support and community, so I went and grabbed that helm chart and installed it. Presto! Now I can support load balancers.

Then, I had to restart a pod–and lost my gitea installation. I didn’t enable persistent storage on my gitea deployment. Well to do that, I need to check the CSI drivers. There’s the default local-path, but that doesn’t allow my pods to move. Since Rancher makes both K3s and Longhorn, I fetched the longhorn helm chart and had persistent storage.

Then I needed to customize Traefik (installed by default) and broke it…

…and wanted to monitor everything, so put prometheus on, and broke it again…

..and there came a point where I questioned whether I was really experienced at Kubernetes at all!1

My problem wasn’t experience or knowledge based, but rather how I had chosen to operate. Every time I rebuilt the cluster, I would say to myself “I should probably automate this–I’ll do it after I build it”…and never go back to it.

I realized that most of my IT career had been spent watching customers and clients install a package to a linux server, or build a new S3 bucket in the AWS console, or apply a schema patch to a database…

…and I had just done the same thing!

My proposed solution had always been the same: just automate it. So I did.

Now, I can completely wipe off k3s from the SBDs, and in one command get it running again.

Attaching the wheels to the frame

With a Kubernetes cluster, I have a frame(work) that I can put widgets on. Like a car can’t go anywhere without wheels (still waiting for my flying car, thanks Back to the Future Part II), my Kubernetes cluster needs some support before I can use it for my true goals. I need MetalLB, a CSI, a customized traefik, etc.

One reason I picked ansible for building the cluster was that I could use it to both deploy the cluster AND the Kubernetes resources. I also considered OpenTofu (not Terraform–here’s why) and had a few other suggestions (which I haven’t really looked at yet). I may go that direction in the future, but borrowing the leadership principle Bias for Action, I picked one and can always change it later.

Bias for Action
Speed matters in business. Many decisions and actions are reversible and do not need extensive study. We value calculated risk taking.

-Amazon Leadership Principles

I started with a basic playbook template to make sure I could query Kubernetes by listing the namespaces in the cluster.

- name: Kubernetes Components
  hosts: kubernetes
  gather_facts: false
    - kubernetes.core.k8s_info:
        context: k3s-ansible
        kind: Namespace
      register: ns
    - ansible.builtin.debug:
        var: ns.resources | map(attribute='') | list

I have this host entry in my inventory.yaml file as well. This lets me specify kubernetes as the host above.

    ansible_connection: local
    ansible_python_interpreter: "{{ansible_playbook_python}}"

As a quick test, I get this output.

PLAY [Kubernetes Components] ******************************************************************************

TASK [kubernetes.core.k8s_info] ***************************************************************************
ok: [k8s-azeroth]

TASK [ansible.builtin.debug] **#***************************************************************************
ok: [k8s-azeroth] => {
    "ns.resources | map(attribute='') | list": [

PLAY RECAP ************************************************************************************************
k8s-azeroth        : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

I now have an easy mechanism to call the kubernetes API from within my same ansible structure!

Adding the first component – MetalLB

After looking at the MetalLB installation guide, it also supports kustomize, so I tried to setup kustomize through ansible. The task is still kubernetes.core.k8s, but there’s a lookup module specifically for kustomize). The task looks like this:

    - name: Network - MetalLB
        state: present
        namespace2: metallb-system
        definition: "{{ lookup('kubernetes.core.kustomize', dir='' ) }}"
      tags: network

It took some investigation, but the task above is the equivalent of this kubectl command:

kubectl create -n metallb-system -k

Each task supports tags, which I can use later to only install a certain type of component. In this case, I could limit the tasks to the network tag. While it’s not necessary now, it becomes useful very fast.

MetalLB also takes a little extra configuration, which is provided in the form of CustomResources. In my homelab, I have carved out a specific IP range for the load balancer, and I assign it to this cluster with this task:

    - name: Network - LoadBalancer IP addresses
        state: present
        src: ../manifests/metallb/ipaddresspool.yaml
      tags: network

For reference, ipaddresspool.yaml contains:

kind: IPAddressPool
  name: default
  namespace: metallb-system
kind: L2Advertisement
  name: default
  namespace: metallb-system

Alternatively, I can use the full power of the kubernetes.core.k8s module to rearrange and pull files or definitions as necessary. For example, I could combine both files into two ansible tasks, placing the resource definition verbatim under the definition: property.

    - name: IPAddressPool
        state: present
          kind: IPAddressPool
            name: default
            namespace: metallb-system
      tags: network
    - name: L2Advertisement
        state: present
          kind: L2Advertisement
            name: default
            namespace: metallb-system
      tags: network

This is the flexibility I was looking for, and I’m using the same tool for everything thus far!


I’m documenting my complete stack (“eventually”), but I can use this pattern to add different tasks and plays in the same way I’d manage helm charts, resource definitions, or kustomizations. I’d like to try the same setup with terraform or other tools (or to read someone else’s blog about it!), but first I have more components to install before I can put Gitea on my cluster!

  1. Imposter syndrome is real! After all these years, I still feel like an imposter, even if I’ve talked about a topic a hundred times before. You don’t have to know it all–but share what you do know and help someone else learn! ↩︎
  2. Ansible additionally creates the namespace if it does not exist, since it would be required for the state to succeed. ↩︎

Rotate IAM Access Keys

How often do you change your password?

Within AWS is a service called Trusted Advisor. Trusted Advisor runs checks in an AWS account looking for best practices around Cost Optimization, Fault Tolerance, Performance, and Security.

In the Security section, there’s a check (Business and Enterprise Support only) for the age of an Access Key attached to an IAM user. The Trusted Advisor check that will warn for any key older than 90 days and alert for any key older than 2 years. AWS recommends rotating the access keys for each IAM user in the account.

From Trusted Advisor Best Practices (Checks):

Checks for active IAM access keys that have not been rotated in the last 90 days. When you rotate your access keys regularly, you reduce the chance that a compromised key could be used without your knowledge to access resources. For the purposes of this check, the last rotation date and time is when the access key was created or most recently activated. The access key number and date come from the access_key_1_last_rotated and access_key_2_last_rotated information in the most recent IAM credential report.

The reason for these times is the mean time to crack an access key. Using today’s standard processing unit, and AWS Access Key could take xxx to crack, and users should rotate their Access Key before that time.

Yet in my experience, this often goes unchecked. I’ve come across an Access Key that was 4.5 years old! I asked why not change it, and the answer is mostly the same–the AWS Administrators and Security teams do not own and manage the credential, and the user doesn’t want to change the credential for fear it will break their process.

Rotating an AWS Access Key is not difficult. It’s a few simple commands to the AWS CLI (which you presumably have installed if you have an Access Key).

  1. Create a new access key (CreateAccessKey API)
  2. Configure AWS CLI to use the new access key (aws configure)
  3. Disable the old access key (UpdateAccessKey API)
  4. Delete the old access key (DeleteAccessKey API)

Instead of requiring each user to remember the correct API calls and parameters to each, I’ve created a script in buzzsurfr/aws-utils called that orchestrates the process. Written in Python (a dependency of AWS CLI, so again should be present), the script minimizes the number of parameters and removes the undifferentiated heavy lifting associated with selecting the correct key. The user’s access is confirmed to be stable by using the new access key to remove the old access key. The script can be scheduled using crown or Scheduled Tasks and supports CLI profiles.

usage: [-h] --user-name USER_NAME
                            [--access-key-id ACCESS_KEY_ID]
                            [--profile PROFILE] [--delete] [--no-delete]

optional arguments:
  -h, --help            show this help message and exit
  --user-name USER_NAME
                        UserName of the AWS user
  --access-key-id ACCESS_KEY_ID
                        Specific Access Key to replace
  --profile PROFILE     Local profile
  --delete              Delete old access key after inactivating (Default)
  --no-delete           Do not delete old access key after inactivating
  --verbose             Verbose

In order to use the script, the user must have the right set of permissions for their IAM user. This template is an example and only grants the IAM user permissions to change their own access Key.

From IAM: Allows IAM Users to Rotate Their Own Credentials Programmatically and in the Console:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "*"
            "Effect": "Allow",
            "Action": [
            "Resource": ["arn:aws:iam::*:user/${aws:username}"]

This script is designed for users to rotate their credentials. This does not apply for “service accounts” (where the credential is configured on a server or unattended machine). If the machine is an EC2 Instance or ECS Task, then attaching an IAM Role to the instance or task will automatically handle rotating the credential. If the machine is on-premise or hosted elsewhere, then adapt the script to work unattended (I’ve thought about coding it as well).

As an AWS Administrator, you cant simply pass out the script and expect all users to rotate their access keys on time. Remember to build the system around it. Periodically query the TA check looking for access keys older than 90 days (warned), and send that user a reminder to rotate their access key. Take it a step further by automatically disabling access keys older than 120 days (warn them in the reminder). Help create good security posture and a good experience for your users, and make your account more secure at the same time!

Add Athena Partition for ELB Access Logs

If you’ve worked on a load balancer, then at some point you’ve been witness to the load balancer taking the blame for an application problem (like a rite of passage). This used to be difficult to exonerate, but with AWS Elastic Load Balancing you can capture Access Logs (Classic and Application only) and very quickly identify whether the load balancer contributed to the problem.

Much like any log analysis, the volume of logs and frequency of access are key to identify the best log analysis solution. If you have a large store of logs but infrequently access them, then a low-cost option is Amazon Athena. Athena enables you to run SQL-based queries against your data in S3 without an ETL process. The data is durable and you only pay for the volume of data scanned per query. AWS also includes documentation and templates for querying Classic Load Balancer logs and Application Load Balancer logs.

This is a great model, but with a potential flaw–as the data set grows in size, the queries become slower and more expensive. To remediate, Amazon Athena allows you to partition your data. This restricts the amount of data scanned, thus lowering costs and increasing speed of the query.

ELB Access Logs store the logs in S3 using the following format:


Since the prefix does not pre-define partitions, the partitions must be created manually. Instead of creating partitions ad-hoc, create a CloudWatch Scheduled Event that runs daily targeted at a Lambda function that adds the partition. To simplify the process, I created buzzsurfr/athena-add-partition.

This project is both the Lambda function code and a CloudFormation template to deploy the Lambda function and the CloudWatch Scheduled Event. Logs are sent from the Load Balancer into a S3 bucket. Daily, the CloudWatch Scheduled Event will invoke the Lambda function to add a partition to the Athena table.

Using the partitions requires modifying the SQL query used in the Athena console. Consider the basic query to return all records: SELECT * FROM logs.elb_logs. Add/append to a WHERE clause including the partition keys with values. For example, to query only the records for July 31, 2018, run:

FROM logs.elb_logs
    year = '2018' AND
    month = '07' AND
    day = '31'

This query with partitions enabled restricts Athena to only scanning


instead of


resulting in a significant reduction in cost and processing time.

Using partitions also makes it easier to enable other Storage Classes like Infrequent Access, where you pay less to store but pay more to access. Without partitions, every query would scan the bucket/prefix and potentially cost more due to the access cost for objects with Infrequent Access storage class.

This model can be applied to other logs stored in S3 that do not have pre-defined partitions, such as CloudTrail logsCloudFront logs, or for other applications that export logs to S3, but don’t allow modifications to the organizational structure.