Custom Kubernetes Installation Journey

If you are here and reading this post, I am assuming that you either actively use Kubernetes, are trying to explore it, or just want to learn more about it. Regardless of that, I am going to share my experience of using Kubernetes. Along the way, I’m going to share what I have learned and talk about the journey that I had on how to install Kubernetes for the production use case in one of my previous projects at work.

When I started to look into Kubernetes, I started to learn about it by reading its documentation, digging into its internal components, doing some POC, and playing with it, using EKS, the managed kubernetes service on AWS. After a couple of months, finally, it was time to think about production and a way to automate the entire deployment of Kubernetes and have a method to provision and bootstrap Kubernetes in a one click fashion. One of the requirements that we had, was to be able to deploy Kubernetes control plan in either single mode for smaller customers or HA enabled mode for larger customers. The other requirement was to be able to deploy Kubernetes on any type of public or private cloud provider. Because of the second requirement, we couldn’t use any of the managed Kubernetes services such as EKS and we had to deploy both control plane and data plane. 
We tried different methods such as using Ansible and Terraform, using Kubeadm and a couple of other methods, each with its own pros and cons until finally, we found the best way that works for our us based on our use case and requirements.

What we ended up doing was writing a custom Kubernetes installer. If you think about this problem, it actually has two parts. First, provisioning the infrastructure needed to run the Kubernetes such as provisioning the virtual machines and configuring their networking and firewall rules , etc. Second, installation and bootstrapping the Kubernetes itself. This is installation of control plane components and setting up the master node and joining the worker nodes to the cluster. Our first goal was to solve both of these problems for AWS as our majority of customers were on AWS based on a survey that we conducted among all our customers.
We ended up to use the following open source tools and features of AWS:

  • AWS CDK
  • AWS SSM Agent
    • SSM Parameter Store
    • SSM Run Command
  • Kurl.sh

Let’s review and see what are each of these tools in a high level and later I explain how did we end up using them for our custom Kubernetes installer.

AWS CDK

Cloud Development Kit (CDK) is an open source framework used for defining and provisioning your cloud resources such as VPC, EC2 instances, Lambda functions, etc using the programming  language of your choice. (At the time of this writing you can choose Python, TypeScript, Java, JavaScript, C#, .Net)
CDK is kind of comparable to Terraform and CloudFormation. In fact, after you write your CDK, it will get translated to CloudFormation templates. What makes CDK unique is its capability of using a familiar programming language with all of its familiar concepts such as loops, conditions, and even abstract concepts like OOP and which allows you to write reusable code easily.
CDK provides high level components named constructs that allows you to preconfigure your cloud resources with the default values based on AWS best practices so you don’t need to be an expert that helps you to reduces the chance of misconfiguration. Because of this, you end up writing way less code in comparison to Terraform or CloudFormation and become way more efficient.
CDK components can be highly customized and be reusable. Just like any other software libraries, you can easily share your code in other projects which enables you to bootstrap a new project with already best practices in no time.

AWS SSM Agent

AWS Systems Manager Agent (SSM Agent) is Amazon binary and executable software that you can configure and use on your Linux, Windows, and Mac Virtual Machines (VM) running on AWS or On-premises. Once the agent is installed and configured you can update, manage, and configure these VMs. These resources also send back information about their status and execution to the AWS System Manager Service and communicate with each other using Amazon Message Delivery Service on ec2messages.* endpoints.
One of the use cases of this Agent is providing a way to send commands to your EC2 instances to execute or start a task. AWS System Manager Run Command is an AWS tool that allows you to securely and remotely send commands to your managed instances (Agent is installed). This is comparable to Chef and Ansible (although Ansible is Agent less) that is mostly used by many companies to configure and modify their fleet of machines in their environments. It also allows you to group some of your machines together based on their tag and send your commandS to only a subset of Instances in your fleet.

AWS Systems Manager Parameter Store is another feature of SSM that provides a secure and remote way to manage your configuration data and secrets. SSM Parameter store’s hierarchical storage model allows you to create namespaces easily in your application. You can store your data in the key, value form in  plain text or encrypted data.

Kur.sh

Kurl.sh is an open source custom Kubernetes distribution generator. Kurl.sh provide a user friendly UI that you can use to customize your distribution. Once you customize your Kubernetes cluster configuration such as which versions of Kubernetes you want to use, and which add-on such as ingress, CNI plugin, metrics and monitoring along with their specific versions you want to include in your distribution based on your use case, it will generate a custom URL that you can use online or download it for an air gapped environment installation.
Kurl.sh uses Kubeadm under the hood. One of the difficulties of installing Kubernetes using Kubeadm is the need for a variety of tasks such as installing Kubeadm, installing container run time such as Docker, configuring default cluster networking, etc that administrators need to perform before and after kubeadm join to make an installation of Kubernetes on a cluster of machines production ready.

Custom Installer

Now that we have a high level understanding of these tools, let’s see how we ended up using them to create our own Kubernetes installer.
As you can guess, we used CDK to provision our AWS resources. We had one CDK module to create and configure a VPC which then will be consumed in two more modules which each one provision and configure EC2 instances. One module for control plane, and another one for data plane.
We use user data which is a way to pass custom bootstrapping commands to the EC2 instances for their lunch time. For the control plane, we run the Kurl.sh installer that takes care of downloading and installing all the control plane components such as etcd, kube-apiserver, kube-scheduler, etc as a docker service. Once the installation is done, Kurl.sh will print out two very important messages, each containing a different command.

  1. Join Command
  2. Generate new Join Command

I will explain what are these two commands shortly, but before jumping to their explanation, it’s important to note that after a successful run of Kurl.sh command the rest of the script will parse these two commands and will push them to the SSM Parameter Store to be consumed by the worker node at their join time.
The first one, “Join command” is used for any worker node to join the cluster. For security purposes this command uses a token which is ephemeral and it’s only valid for 24 hours. A valid token is required for any node to join the Master node. This means that, if you want to join a new node to your cluster after a week or so, this command does not work and that is why Kurl.sh provides the second command. As you may guess the second command can be run in the master node to generate a new join command with a fresh token.
For any worker node that is joining the cluster, right after the first boot of the operating system, it will execute commands in its user data script. As the first part of this script, worker node will fetch and read the first command (Join Command) which have been pushed to the SSM Parameter store after a successful control plane installation previously. By checking the timestamp, it will decide to use the join command which happens only if the timestamp is less than 24 hours. If it is an older command the second command (Generate a new join command) will be fetched from the SSM Parameter Store and will be executed on the Master node using the SSM Run Command. The master node then generates a new join command and will push that to the SSM Parameter Store to be used by the worker node. After about 30 seconds or so, the worker node that wants to join the cluster can fetch the new join command again from the SSM Parameter store and use it to join the Master node as it’s a fresh one.
Using this method, we can join any worker node to our cluster at any time without any human intervention. This method also allows us dynamically scale our cluster by making sure that a new worker node that just has been spin up using Autoscaling can join the cluster at anytime.