Skip to content

Configure Github Actions Runner Controller on EKS #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

hussein-awala
Copy link
Member

@hussein-awala hussein-awala commented Jul 20, 2023

This PR:

  • Creates a new EKS cluster in Airflow AWS account using Terraform to run the Github Actions jobs:
    • A new VPC
    • 4 managed node groups: default for infra and ARC resources, small, medium and large where the runners pods should be scheduled
    • an IAM role with IRSA policy to use it by the Autoscaler
  • Install and configure the K8S resources and ARC using helm:
    • Create namespaces
    • Install an helm release for Autoscaler
    • Install an helm release for cert manager (user by ARC)
    • Install an helm release for ARC
    • Create multiple deployments for runners with horizontal autoscalers each one with different config and labels.
  • Adds the Dockerfile and scripts to install and use the Airflow Actions Runner.
  • Adds the documentation of the different resources and explain the steps to deploy a new change (will be followed by a PR which adds a CI workflow to plan and deploy the CI infrastructure)

@hussein-awala hussein-awala changed the title [WIP] Configure Github Actions Runner Controller on EKS Configure Github Actions Runner Controller on EKS Jul 23, 2023
@hussein-awala hussein-awala marked this pull request as ready for review July 23, 2023 17:57
@ashb
Copy link
Member

ashb commented Jul 25, 2023

Other things to note: Our current runners all have a docker user login (airflowcirunners) to avoid hitting any pull limits.

What is the "baseline" cost involved in running the EKS cluster and the ARC when no jobs are running?

default_nodes = {
name = "default"

instance_types = ["t4g.medium"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current runners are running on any of the following instance types -- which ever is the cheapest spot price

r6a.2xlarge
r5.2xlarge
r5ad.2xlarge
r5n.2xlarge
r5b.2xlarge
r5a.2xlarge
r6i.2xlarge

And we mount things as tmpfs using ~50% of the RAM (I thought):

  - [tmpfs, /var/lib/docker, tmpfs, "defaults,noatime,size=85%"]
  - [tmpfs, /home/runner/actions-runner/_work, tmpfs, "defaults,noatime"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use this list of instances with different size for each runner deployment (replacing 2xlarg by medium and large for small and medium runners). WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the hindsight, I think we should try to run tests without applying the tmpfs mounts.

I believe, that the speed up and stability for running tests is not as big as I originally thought. I think having everything on tmpfs has great impact when you produce a lot of small files (i.e. when you compile a lot of code). But when you just run stuff, or run docker where the files are mostly written once and there are not many thousands of them being created and updated,, having a "tmpfs" does not change much - actually it even increases memory usage because the files on the filesytem are anyhow cached by the POSIX FS layer. So maybe that tmpfs is not really needed to get a lot of speedup coming from multiple processors running tests in parallell using the same prepared images.

After we get a "plain" solution working - we could then apply tmpfs and see the difference and compare.

@hussein-awala
Copy link
Member Author

Other things to note: Our current runners all have a docker user login (airflowcirunners) to avoid hitting any pull limits.

Yes I was planing to add it before merging, for now I'm using an ECR to store the runners images to avoid pulling them from dockerhub, but we have other images in the workflow which we pull from dockerhub. I already reach my 6 hours limits when I ran the tests multiple time.

What is the "baseline" cost involved in running the EKS cluster and the ARC when no jobs are running?

The ARC and the other infra pods are running on default ng, I use t4g.medium for its instances. For the test phase, I had a replicas of ARC and another one for Autoscaler (these 2 services are the most important), but for production we need at least 2 replicas running on 2 different nodes.
The cost of t4g.medium is 0.03$ per hour for on-demand instances, and if we reserve them, it will be 0.02$ (262$ vs 175$ yearly per node), also we have a fix cost for EKS which is 72$ monthly.
So if we use 2 on-demand nodes of type t4g.medium, the monthly cost will be ~115$ (up to 120$ with the storage). Currently I'm using spot instances for this ng too, so the cost is 14$ monthly instead of 21.6$, but IMO it would be better if we use normal node for availability.

However, the support of multiple type of runners (to run small CI jobs on medium node instead of 2xlarge), and the perfect scalability in K8S, will help us to reduce the total cost.

@hussein-awala
Copy link
Member Author

@ashb I made some changes in this PR:

  • I recreated the whole stack to delete the NAT gateway and move the EKS cluster node to a public subnet
  • I removed all the files used in the dockerfile, and now I'm extending the official docker image (ubunto 22.04 instead of 20.04)

But there are a few more to discussM

Our current runners are running on any of the following instance types -- which ever is the cheapest spot price
r6a.2xlarge
r5.2xlarge
r5ad.2xlarge
r5n.2xlarge
r5b.2xlarge
r5a.2xlarge
r6i.2xlarge

All these instances are AMD, one of the goals was migrating completely to AMR (Graviton), according to your list, we can use r6g, WDYT? I created a new var to easily define the instance types:

variable "runners_node_types" {
    description = "Node type for the runners"
    type        = list(string)
    default     = ["t4g"]
}

However, I wonder if the usage of different types with different cpu/ram ratios is a good idea, because when we update a job and it starts to reach the memory limit, we will not be able to detect the problem easily, because sometimes we will run it on a node with a bigger ram. Personally I don't like mixing the node types, but if there is a benefit for the price and the availability of spot instances, I'm ok with that.

And we mount things as tmpfs using ~50% of the RAM (I thought):

  • [tmpfs, /var/lib/docker, tmpfs, "defaults,noatime,size=85%"]
  • [tmpfs, /home/runner/actions-runner/_work, tmpfs, "defaults,noatime"]

Not ready yet, I will check the current runners and try to do the same thing

The runner consider the version v2.304.0-airflow8 as deprecated and it update it by default:

√ Connected to GitHub

Current runner version: '2.304.0-airflow8'
2023-08-13 15:07:22Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.307.1 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should be back online within 10 seconds.
Runner update process finished.
Runner listener exit because of updating, re-launch runner after successful update
Update finished successfully.
Restarting runner...

√ Connected to GitHub

Current runner version: '2.307.1'
2023-08-13 15:07:35Z: Listening for Jobs                                                                                                                                                                       
2023-08-13 15:07:36Z: Running job: Build info

And when I added a conf to disable auto update, the container entered in a crush loop:

√ Connected to GitHub

Current runner version: '2.304.0-airflow8'                                                                                                                                                                      
2023-08-13 14:15:41Z: Listening for Jobs                                                                                                                                                                        
An error occurred: Runner version v2.304.0-airflow8 is deprecated and cannot receive messages.                                                                                                                  
Runner listener exit with retryable error, re-launch runner in 5 seconds.                                                                                                                                       
Restarting runner... 

Is there a reason to not upgrade the Airflow runner to 2.307.1 or it's just because of a lack of free time? If it's related to your time, I can try it.

@potiuk
Copy link
Member

potiuk commented Aug 30, 2024

Cc: @shahar1 @amoghrajesh @ashb @kaxil @jedcunningham @bugraoz93

Comment on lines +19 to +49
runnerScaleSets:
arc-small-amd:
minRunners: 0
maxRunners: 30
size: small
arch: x64
arc-medium-amd:
minRunners: 0
maxRunners: 30
size: medium
arch: x64
arc-large-amd:
minRunners: 0
maxRunners: 30
size: large
arch: x64
arc-small-arm:
minRunners: 0
maxRunners: 30
size: small
arch: arm64
arc-medium-arm:
minRunners: 0
maxRunners: 30
size: medium
arch: arm64
arc-large-arm:
minRunners: 0
maxRunners: 30
size: large
arch: arm64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This totals up to possible a large number of runners -- I worry about the possible cost given we don't have infinite credits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is something we will need to plan separtely. I think personally we only need small and big runners. Small runners (even one CPU) for very small and fast jobs where we have no parallelisation and big runners (even 16 CPU and a lot of RAM) where we have parallelisation (various tests maybe). So I think something like 30 big and 30 small should be quite OK IMHO to start with. Besides for now the small ones shoudl not be even needed - because we can mix-and-match public and self-hosted runners in a single workflow - so we could use public runners for all the "small" jobs and "self-hosted" for all the "big jobs".

But having a flexibility in the future is useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we will need far less ARM runners - they are currently disabled, but unless we want to enable tests for ARM (whcih we could now) we need them only for image building.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This totals up to possible a large number of runners -- I worry about the possible cost given we don't have infinite credits.

Running 30 nodes for 3h costs the same as running 90 nodes for 1h. I don't say that we have to scale to 1k, but we shouldn't worry about the max number of nodes/pods, we just need to correctly configure the scaling of ARC and nodes.

And we will need far less ARM runners - they are currently disabled, but unless we want to enable tests for ARM (whcih we could now) we need them only for image building.

In the first phase, we will only use AMD nodes/runners, but I plan to switch all the CI jobs to ARM runners (except the AMD docker image build), where ARM nodes are faster and cheaper.

For example, for a node with 32Gb of RAM and 8 VCPUs:

  • m7a.2xlarge --> 0.1698$/h
  • m7g.2xlarge --> 0.1202$/h (30% cheaper)

Copy link
Member

@potiuk potiuk Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I think eventually we should run tests on both ARM and AMD notes - this was actually a pre-requisite for me to make ARM part of the images "officially" supported. Currently we explain that ARM images shoudl only be used for development, but I always wanted to run complete set of tests on both AMD and ARM to be able to say "yeah - both platforms are equally well tested".

Running 30 nodes for 3h costs the same as running 90 nodes for 1h. I don't say that we have to scale to 1k, but we shouldn't worry about the max number of nodes/pods, we just need to correctly configure the scaling of ARC and nodes.

If we consider cost alone - yes. But there are other factors that are in favour of using smaller number of bigger machines instead of bigger number of smaller ones. The number of workers - indeed, when we run our own nodes, it does not mattar much if we have 8x 1 CPU or 1x8CPU. Cost wise it does not matter for raw "test execution".

And I agree it would be really nice if we could get rid of all the parallelism introduced by my scripts to run tests and other thing in parallel on a single instance. In this case for example logging output is much more readable (Breeze has this whole suite of uitls that handles multiplexing of the output from parallel processes running and "folding" the output - but due to size of the output sometimes GitHub interface is very sluggish and not convenient to look at the output.

However there are certain benefits of running "big" instances in our case:

  1. for public runners - INFRA sets some limits on how many "jobs" each workflow should run in parallel (because it has limited number of public runners donated by GitHub that are shared betweeen all projects). It does not apply to our self-hosted runners of course, but in the case where we are using public runners with 2 CPUs running 2 tests in parallel will speed up the tests without increasing number of FT (full time) runners we use. If you look at https://infra-reports.apache.org/#ghactions - Airflow is at 10FT runners now (which is still way beyond the limit of 25FT) - mostly because we have parallelism implemented.

  2. Having paralell run tests on big machines can decrease the overall cost of "cluster" a lot. When running tests and other docker-bound tasks (such as documentation building) - it takes some time to "warm-up" the machine - i.e.: Install Breeze, all necessary dependencies clean-up the machine, downloads all the images to be used (Breeze CI image and the DB image that needs to be used during tests - Postgres/MysSQL) - this takes ~ 2/3 minutes in total but more importantly it takes a lot of space.

Some of that is parallelised (like downloading an image) so it will run faster on bigger instances which means that elapsed time of execution "per test type" will be smaller when we run more parallel tests. But more important is an overhead and space/memory used for the "common" image that is reused between all the parallel instances. Currently the tests are grouped on instance per "python version" and "database" - which means that the CI image and "database" image are downloaded once and used by all tests that are run in parallel for that big instance.

So when each "test type" takes (say) 8 minutes - if we run it on a 16 CPU machine, we have to spend the 2-3 minutes of initializaiton only once - all the 16 parallel tests will run on the pre-warmed machine using that single initialized "warmed" docker instance. This is about 4 GB of disk space and cache in memory per "running instance". if - instead - we run 16 x 1 CPU machines - then we will need 4GB x 16 disk space and memory to cache it. That means that effectively our cluster will be able to handle much more load because a lot of the small VMs running will have to use more disk pace/memory. Similarly they use the same kernel - which uses the same memory, so overall having a big machine running parallel tests on muliple docker instances using the same images should be more optimised.

But we can test it once we have ARC in place as well, because there might be other factors (like imperfect parallisim - that has its limits - that might change the calculation - there is https://en.wikipedia.org/wiki/Gustafson%27s_law - that explains that parallelism always has the limit because not everything can be parallelised and some things are always serialized - so there is a theorethical limit of speed-up (and cost optimisation) we can get by parallelising more.

eks_managed_node_groups = {

default_nodes = {
name = "default"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these nodes used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ARC, the autoscaler, proxy, and all K8S services pods...

For runners nodes, I added a taint as protection, so no pod can use them without explicit toleration.

Comment on lines +45 to +68
variable "small_runners_node_size" {
description = "Node size for the small runners"
type = string
default = "medium"
}

variable "small_runners_x64_node_size" {
# This is a separate variable because the r series doesn't have a medium size
description = "Node size for the small runners"
type = string
default = "large"
}

variable "medium_runners_node_size" {
description = "Node size for the medium runners"
type = string
default = "xlarge"
}

variable "large_runners_node_size" {
description = "Node size for the large runners"
type = string
default = "2xlarge"
}
Copy link
Member

@ashb ashb Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm not actually sure having this as variables makes it clearer -- to me (and this just a nit) I find this variable use makes it harder to understand)

      instance_types = [for node_type in var.runners_node_types: "${node_type}.${var.large_runners_node_size}"]

vs

      instance_types = [for node_type in var.runners_node_types: "${node_type}.large"]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal was to avoid touching the code to change these configurations, as we may need to change it according to the performance of our CI jobs. I will try to simplify it.

Airflow_CI.png Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "need to scale if possible" is handled by the Kube cluster autoscaller right, ARC just creates pods and leaves the rest up to K8s?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two scales: pods scaling for runner set deployment and nodes scaling to schedule the pods.

The two types come sequentially, where the ARC detect asks the runner deployment to run a job, if there is no available runners, and the number of runners in the deployment is smaller than the configured max, it creates a new runner pod. The rest is handled by K8S scheduler and autoscaler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. That's the same case that our KEDA autoscaling does - while KEDA can increase the number of workers in celery queue, it's the Kubernetes Autoscaler must setup or tear down more nodes if there are not enough / too many resources to run the number ofworkers we need.

@hussein-awala
Copy link
Member Author

The diagram in the PR is outdated, I switched to Autoscaling Runner Scale Sets mode, as recommended by GitHub.

For more info:

And here is the commit for the switch: 1b82f07.

if you are not familiar with helmfile and helm value layering by gotmpl files, you might find it a bit complicated to understand how the scale sets are created, but it simplifies the creating and configuring of runner scale sets and reduces the redundancy.

The PR is not ready yet, the runner docker image needs some extra packages before becoming ready to run all our CI jobs.

Terraform: Airflow infrastructure as code
========================================

In this folder, you will find the Terraform code to deploy the Airflow infrastructure on AWS.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future note: I could help with automating Terraform deployment via GitHub Actions


RUN apt-get update \
&& apt-get install -y --no-install-recommends \
ca-certificates curl nodejs npm wget unzip vim git jq build-essential netcat \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me!
I skimmed through all the CI jobs and their scripts to ease your job. The CI jobs have steps for installing their dependencies very well. Other than some commands like curl, git, netcat etc. This addition should be enough. I am not sure if we need wget though because please correct me if I am wrong but I haven't seen any usage from wget other than in a couple of Dockerfiles which should install by itself. Maybe, we shouldn't install it here to not deal with its vulnerabilities.
Of course, it's always good to double-check. I hope it helps!

Suggested change
ca-certificates curl nodejs npm wget unzip vim git jq build-essential netcat \
ca-certificates curl nodejs npm wget unzip vim git jq build-essential netcat openssh-client \

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will likely need more things. See the changes I've done for example to make "public" runners work for building our images yesterday (I brought back the workflow that allows to release images to dockerfile using emulation:

https://github.com/apache/airflow/pull/41962/files

We need buildx and regctl for that to work - also when emulation is used we need to bring qemu and other tools needed for docker emulation to work.

I temporary brought the workflow back for @utkarsharma2 and @kaxil convenience yesterday - so that they do not have to run them manually - with emulation it takes about an hour to build ARM layer for production image (single Python version). But we will need still regctl and buildx installed for example (and likely we will find few other things on our runners.

BTW. For the multi-platform images and Hardware support (I am not sure if you remember our conversation @hussein-awala some time ago) we will likely need to apply a bit more complex scheme if we would like to use ARC ARM runners - because (unlike when we started ARM instances from withing the Amazon VM) we need to coordinate two running instances to build the layers separately and create manifest. This seems to be easier now and there are ready-to-use recipes and actions to use - I captured it in apache/airflow#41935 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out in detail! Agree, we need to include these and maybe more. apache/airflow#41935 (comment) is pretty clear. Are we planning to remove emulation after start using ARC ARM runners? Just to understand the whole scope of apache/airflow#41935

Copy link
Member

@potiuk potiuk Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: I think emulation as an option should remain.

Long answer:

Here (and this is already part of the "CI/CD" knowledge transfer :) ) - let me explain why I think it should.

I prefer - if possible - to have manual backup for all the "automated" processes we have in CI. Generallt my "guiding principle" for any kind of CI work like that is to "NEVER 100% rely on particular CI doing the job exclusively". I simply hate to be put in a situation where "something else" fails and the only answer is "we have to wait for them to fix it". This is ok for temporary disruptions, but when it comes to processes like releasing that we do often under time pressure and when there are expectations of a user for the new release to come up quickly - I prefer to not have to rely on 3rd-parties as much as possible.

We actualy saw that last few weeks where the CI workflow to release RC packages was broken - we could only release Airlfow without having to find a "proper" solution immediately - because we had manual process, where you could either use hardware if you happen to have two machines or emulation (which @kaxil used) - even if it means it will take hours instead of minutes. We had plan B, and plan C which involved not only somone (like me) who had the right hardware setup, but also someone who has just a local machine, good networking and can "fire-and-forget" a process that runs for an hour rather than 10 minutes, without any special environment configuration.

BTW. In this case - even if I could help with having the hardware setup, my AMD linux worstation at home ACTUALLY BROKE last week and I got it back only on Friday :D....

So ALWAYS HAVE PLAN B (and C and D sometimes). .... That allows the CI team sleep better :)

In our case we have manual processes for all things that CI jobs currently do automatically, and we do not have a single part of the process that exclusively relies on GitHub Actons CI doing the job:

This basically means that if you look at each step of every CI job and replicate them locally - you should be able to get the same result locally. So if - for whatever reason - our CI will stop working (say ASF will limit our procesing time and we have no money for "self-hosted" runner in AWS - we will be able to replicate (slower and more painfully) what is now happening in CI - manually.

  • secondly - for processes that are likely to fail for whatever reason, we describe manual processes in "step-by-step" guides, explaining a) why we might need to do it b) how to perform setup of the environment c) how to run it by "human"

Those are the current processes described this way:

So - whatever we do, we have to keep the "manual path" working as a backup plan.

Using hardware is a bit problematic because you have to have two machines (ARM and AMD) handy and connected - yes it can be done using cloud (of course) but - ideally, the fallback option we have is to use a local machine of one of the PMC member to do all the stuff above - so that we do not rely on GH actions or even AWS account to be available. That's why emulation is going to stay - I think - as a backup plan.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great know-how! :) Thanks for the detailed answer! When looking from this perspective, it totally makes sense to keep it as an option.

.. code-block:: bash

cd eks
tfenv install 1.5.3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a deal breaker, just a small good to have. We can include a .terraform-version file to manage the version to not check from README. We can then call tfenv install to install the correct version without specifying. This can be done in a later stage as well after this one is merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pinned the version, I will update the readme as we don't need to install it manually anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants