-
Notifications
You must be signed in to change notification settings - Fork 8
Configure Github Actions Runner Controller on EKS #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…essWorkflowRuns metrics
…se managed by Ash
Other things to note: Our current runners all have a docker user login ( What is the "baseline" cost involved in running the EKS cluster and the ARC when no jobs are running? |
terraform/eks/eks.tf
Outdated
default_nodes = { | ||
name = "default" | ||
|
||
instance_types = ["t4g.medium"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current runners are running on any of the following instance types -- which ever is the cheapest spot price
r6a.2xlarge
r5.2xlarge
r5ad.2xlarge
r5n.2xlarge
r5b.2xlarge
r5a.2xlarge
r6i.2xlarge
And we mount things as tmpfs using ~50% of the RAM (I thought):
- [tmpfs, /var/lib/docker, tmpfs, "defaults,noatime,size=85%"]
- [tmpfs, /home/runner/actions-runner/_work, tmpfs, "defaults,noatime"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use this list of instances with different size for each runner deployment (replacing 2xlarg by medium and large for small and medium runners). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the hindsight, I think we should try to run tests without applying the tmpfs mounts.
I believe, that the speed up and stability for running tests is not as big as I originally thought. I think having everything on tmpfs has great impact when you produce a lot of small files (i.e. when you compile a lot of code). But when you just run stuff, or run docker where the files are mostly written once and there are not many thousands of them being created and updated,, having a "tmpfs" does not change much - actually it even increases memory usage because the files on the filesytem are anyhow cached by the POSIX FS layer. So maybe that tmpfs is not really needed to get a lot of speedup coming from multiple processors running tests in parallell using the same prepared images.
After we get a "plain" solution working - we could then apply tmpfs and see the difference and compare.
Yes I was planing to add it before merging, for now I'm using an ECR to store the runners images to avoid pulling them from dockerhub, but we have other images in the workflow which we pull from dockerhub. I already reach my 6 hours limits when I ran the tests multiple time.
The ARC and the other infra pods are running on default ng, I use However, the support of multiple type of runners (to run small CI jobs on medium node instead of 2xlarge), and the perfect scalability in K8S, will help us to reduce the total cost. |
@ashb I made some changes in this PR:
But there are a few more to discussM
All these instances are AMD, one of the goals was migrating completely to AMR (Graviton), according to your list, we can use r6g, WDYT? I created a new var to easily define the instance types: variable "runners_node_types" {
description = "Node type for the runners"
type = list(string)
default = ["t4g"]
} However, I wonder if the usage of different types with different cpu/ram ratios is a good idea, because when we update a job and it starts to reach the memory limit, we will not be able to detect the problem easily, because sometimes we will run it on a node with a bigger ram. Personally I don't like mixing the node types, but if there is a benefit for the price and the availability of spot instances, I'm ok with that.
Not ready yet, I will check the current runners and try to do the same thing The runner consider the version
And when I added a conf to disable auto update, the container entered in a crush loop:
Is there a reason to not upgrade the Airflow runner to 2.307.1 or it's just because of a lack of free time? If it's related to your time, I can try it. |
runnerScaleSets: | ||
arc-small-amd: | ||
minRunners: 0 | ||
maxRunners: 30 | ||
size: small | ||
arch: x64 | ||
arc-medium-amd: | ||
minRunners: 0 | ||
maxRunners: 30 | ||
size: medium | ||
arch: x64 | ||
arc-large-amd: | ||
minRunners: 0 | ||
maxRunners: 30 | ||
size: large | ||
arch: x64 | ||
arc-small-arm: | ||
minRunners: 0 | ||
maxRunners: 30 | ||
size: small | ||
arch: arm64 | ||
arc-medium-arm: | ||
minRunners: 0 | ||
maxRunners: 30 | ||
size: medium | ||
arch: arm64 | ||
arc-large-arm: | ||
minRunners: 0 | ||
maxRunners: 30 | ||
size: large | ||
arch: arm64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This totals up to possible a large number of runners -- I worry about the possible cost given we don't have infinite credits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is something we will need to plan separtely. I think personally we only need small and big runners. Small runners (even one CPU) for very small and fast jobs where we have no parallelisation and big runners (even 16 CPU and a lot of RAM) where we have parallelisation (various tests maybe). So I think something like 30 big and 30 small should be quite OK IMHO to start with. Besides for now the small ones shoudl not be even needed - because we can mix-and-match public and self-hosted runners in a single workflow - so we could use public runners for all the "small" jobs and "self-hosted" for all the "big jobs".
But having a flexibility in the future is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we will need far less ARM runners - they are currently disabled, but unless we want to enable tests for ARM (whcih we could now) we need them only for image building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This totals up to possible a large number of runners -- I worry about the possible cost given we don't have infinite credits.
Running 30 nodes for 3h costs the same as running 90 nodes for 1h. I don't say that we have to scale to 1k, but we shouldn't worry about the max number of nodes/pods, we just need to correctly configure the scaling of ARC and nodes.
And we will need far less ARM runners - they are currently disabled, but unless we want to enable tests for ARM (whcih we could now) we need them only for image building.
In the first phase, we will only use AMD nodes/runners, but I plan to switch all the CI jobs to ARM runners (except the AMD docker image build), where ARM nodes are faster and cheaper.
For example, for a node with 32Gb of RAM and 8 VCPUs:
- m7a.2xlarge --> 0.1698$/h
- m7g.2xlarge --> 0.1202$/h (30% cheaper)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I think eventually we should run tests on both ARM and AMD notes - this was actually a pre-requisite for me to make ARM part of the images "officially" supported. Currently we explain that ARM images shoudl only be used for development, but I always wanted to run complete set of tests on both AMD and ARM to be able to say "yeah - both platforms are equally well tested".
Running 30 nodes for 3h costs the same as running 90 nodes for 1h. I don't say that we have to scale to 1k, but we shouldn't worry about the max number of nodes/pods, we just need to correctly configure the scaling of ARC and nodes.
If we consider cost alone - yes. But there are other factors that are in favour of using smaller number of bigger machines instead of bigger number of smaller ones. The number of workers - indeed, when we run our own nodes, it does not mattar much if we have 8x 1 CPU or 1x8CPU. Cost wise it does not matter for raw "test execution".
And I agree it would be really nice if we could get rid of all the parallelism introduced by my scripts to run tests and other thing in parallel on a single instance. In this case for example logging output is much more readable (Breeze has this whole suite of uitls that handles multiplexing of the output from parallel processes running and "folding" the output - but due to size of the output sometimes GitHub interface is very sluggish and not convenient to look at the output.
However there are certain benefits of running "big" instances in our case:
-
for public runners - INFRA sets some limits on how many "jobs" each workflow should run in parallel (because it has limited number of public runners donated by GitHub that are shared betweeen all projects). It does not apply to our self-hosted runners of course, but in the case where we are using public runners with 2 CPUs running 2 tests in parallel will speed up the tests without increasing number of FT (full time) runners we use. If you look at https://infra-reports.apache.org/#ghactions - Airflow is at 10FT runners now (which is still way beyond the limit of 25FT) - mostly because we have parallelism implemented.
-
Having paralell run tests on big machines can decrease the overall cost of "cluster" a lot. When running tests and other docker-bound tasks (such as documentation building) - it takes some time to "warm-up" the machine - i.e.: Install Breeze, all necessary dependencies clean-up the machine, downloads all the images to be used (Breeze CI image and the DB image that needs to be used during tests - Postgres/MysSQL) - this takes ~ 2/3 minutes in total but more importantly it takes a lot of space.
Some of that is parallelised (like downloading an image) so it will run faster on bigger instances which means that elapsed time of execution "per test type" will be smaller when we run more parallel tests. But more important is an overhead and space/memory used for the "common" image that is reused between all the parallel instances. Currently the tests are grouped on instance per "python version" and "database" - which means that the CI image and "database" image are downloaded once and used by all tests that are run in parallel for that big instance.
So when each "test type" takes (say) 8 minutes - if we run it on a 16 CPU machine, we have to spend the 2-3 minutes of initializaiton only once - all the 16 parallel tests will run on the pre-warmed machine using that single initialized "warmed" docker instance. This is about 4 GB of disk space and cache in memory per "running instance". if - instead - we run 16 x 1 CPU machines - then we will need 4GB x 16 disk space and memory to cache it. That means that effectively our cluster will be able to handle much more load because a lot of the small VMs running will have to use more disk pace/memory. Similarly they use the same kernel - which uses the same memory, so overall having a big machine running parallel tests on muliple docker instances using the same images should be more optimised.
But we can test it once we have ARC in place as well, because there might be other factors (like imperfect parallisim - that has its limits - that might change the calculation - there is https://en.wikipedia.org/wiki/Gustafson%27s_law - that explains that parallelism always has the limit because not everything can be parallelised and some things are always serialized - so there is a theorethical limit of speed-up (and cost optimisation) we can get by parallelising more.
eks_managed_node_groups = { | ||
|
||
default_nodes = { | ||
name = "default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these nodes used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ARC, the autoscaler, proxy, and all K8S services pods...
For runners nodes, I added a taint as protection, so no pod can use them without explicit toleration.
variable "small_runners_node_size" { | ||
description = "Node size for the small runners" | ||
type = string | ||
default = "medium" | ||
} | ||
|
||
variable "small_runners_x64_node_size" { | ||
# This is a separate variable because the r series doesn't have a medium size | ||
description = "Node size for the small runners" | ||
type = string | ||
default = "large" | ||
} | ||
|
||
variable "medium_runners_node_size" { | ||
description = "Node size for the medium runners" | ||
type = string | ||
default = "xlarge" | ||
} | ||
|
||
variable "large_runners_node_size" { | ||
description = "Node size for the large runners" | ||
type = string | ||
default = "2xlarge" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm not actually sure having this as variables makes it clearer -- to me (and this just a nit) I find this variable use makes it harder to understand)
instance_types = [for node_type in var.runners_node_types: "${node_type}.${var.large_runners_node_size}"]
vs
instance_types = [for node_type in var.runners_node_types: "${node_type}.large"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal was to avoid touching the code to change these configurations, as we may need to change it according to the performance of our CI jobs. I will try to simplify it.
Airflow_CI.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "need to scale if possible" is handled by the Kube cluster autoscaller right, ARC just creates pods and leaves the rest up to K8s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two scales: pods scaling for runner set deployment and nodes scaling to schedule the pods.
The two types come sequentially, where the ARC detect asks the runner deployment to run a job, if there is no available runners, and the number of runners in the deployment is smaller than the configured max, it creates a new runner pod. The rest is handled by K8S scheduler and autoscaler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. That's the same case that our KEDA autoscaling does - while KEDA can increase the number of workers in celery queue, it's the Kubernetes Autoscaler must setup or tear down more nodes if there are not enough / too many resources to run the number ofworkers we need.
The diagram in the PR is outdated, I switched to Autoscaling Runner Scale Sets mode, as recommended by GitHub. For more info:
And here is the commit for the switch: 1b82f07. if you are not familiar with helmfile and helm value layering by The PR is not ready yet, the runner docker image needs some extra packages before becoming ready to run all our CI jobs. |
Terraform: Airflow infrastructure as code | ||
======================================== | ||
|
||
In this folder, you will find the Terraform code to deploy the Airflow infrastructure on AWS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future note: I could help with automating Terraform deployment via GitHub Actions
runner/Dockerfile
Outdated
|
||
RUN apt-get update \ | ||
&& apt-get install -y --no-install-recommends \ | ||
ca-certificates curl nodejs npm wget unzip vim git jq build-essential netcat \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me!
I skimmed through all the CI jobs and their scripts to ease your job. The CI jobs have steps for installing their dependencies very well. Other than some commands like curl
, git
, netcat
etc. This addition should be enough. I am not sure if we need wget
though because please correct me if I am wrong but I haven't seen any usage from wget
other than in a couple of Dockerfiles which should install by itself. Maybe, we shouldn't install it here to not deal with its vulnerabilities.
Of course, it's always good to double-check. I hope it helps!
ca-certificates curl nodejs npm wget unzip vim git jq build-essential netcat \ | |
ca-certificates curl nodejs npm wget unzip vim git jq build-essential netcat openssh-client \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will likely need more things. See the changes I've done for example to make "public" runners work for building our images yesterday (I brought back the workflow that allows to release images to dockerfile using emulation:
https://github.com/apache/airflow/pull/41962/files
We need buildx
and regctl
for that to work - also when emulation is used we need to bring qemu and other tools needed for docker emulation to work.
I temporary brought the workflow back for @utkarsharma2 and @kaxil convenience yesterday - so that they do not have to run them manually - with emulation it takes about an hour to build ARM layer for production image (single Python version). But we will need still regctl and buildx installed for example (and likely we will find few other things on our runners.
BTW. For the multi-platform images and Hardware support (I am not sure if you remember our conversation @hussein-awala some time ago) we will likely need to apply a bit more complex scheme if we would like to use ARC ARM runners - because (unlike when we started ARM instances from withing the Amazon VM) we need to coordinate two running instances to build the layers separately and create manifest. This seems to be easier now and there are ready-to-use recipes and actions to use - I captured it in apache/airflow#41935 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out in detail! Agree, we need to include these and maybe more. apache/airflow#41935 (comment) is pretty clear. Are we planning to remove emulation after start using ARC ARM runners? Just to understand the whole scope of apache/airflow#41935
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: I think emulation as an option should remain.
Long answer:
Here (and this is already part of the "CI/CD" knowledge transfer :) ) - let me explain why I think it should.
I prefer - if possible - to have manual backup for all the "automated" processes we have in CI. Generallt my "guiding principle" for any kind of CI work like that is to "NEVER 100% rely on particular CI doing the job exclusively". I simply hate to be put in a situation where "something else" fails and the only answer is "we have to wait for them to fix it". This is ok for temporary disruptions, but when it comes to processes like releasing that we do often under time pressure and when there are expectations of a user for the new release to come up quickly - I prefer to not have to rely on 3rd-parties as much as possible.
We actualy saw that last few weeks where the CI workflow to release RC packages was broken - we could only release Airlfow without having to find a "proper" solution immediately - because we had manual process, where you could either use hardware if you happen to have two machines or emulation (which @kaxil used) - even if it means it will take hours instead of minutes. We had plan B, and plan C which involved not only somone (like me) who had the right hardware setup, but also someone who has just a local machine, good networking and can "fire-and-forget" a process that runs for an hour rather than 10 minutes, without any special environment configuration.
BTW. In this case - even if I could help with having the hardware setup, my AMD linux worstation at home ACTUALLY BROKE last week and I got it back only on Friday :D....
So ALWAYS HAVE PLAN B (and C and D sometimes). .... That allows the CI team sleep better :)
In our case we have manual processes for all things that CI jobs currently do automatically, and we do not have a single part of the process that exclusively relies on GitHub Actons CI doing the job:
- first of all - all commands in CI we run are not "GitHub Actions" exclusively (except some environment setup) - most of the important actions are
breeze
commands that can be run manually, locally. This is the main reason why we havebreeze
actually - and it's nicely captured in this "Architecture Decision Record" https://github.com/apache/airflow/blob/main/dev/breeze/doc/adr/0002-implement-standalone-python-command.md
This basically means that if you look at each step of every CI job and replicate them locally - you should be able to get the same result locally. So if - for whatever reason - our CI will stop working (say ASF will limit our procesing time and we have no money for "self-hosted" runner in AWS - we will be able to replicate (slower and more painfully) what is now happening in CI - manually.
- secondly - for processes that are likely to fail for whatever reason, we describe manual processes in "step-by-step" guides, explaining a) why we might need to do it b) how to perform setup of the environment c) how to run it by "human"
Those are the current processes described this way:
- https://github.com/apache/airflow/blob/main/dev/MANUALLY_BUILDING_IMAGES.md
- https://github.com/apache/airflow/blob/main/dev/MANUALLY_GENERATING_IMAGE_CACHE_AND_CONSTRAINTS.md
So - whatever we do, we have to keep the "manual path" working as a backup plan.
Using hardware is a bit problematic because you have to have two machines (ARM and AMD) handy and connected - yes it can be done using cloud (of course) but - ideally, the fallback option we have is to use a local machine of one of the PMC member to do all the stuff above - so that we do not rely on GH actions or even AWS account to be available. That's why emulation is going to stay - I think - as a backup plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great know-how! :) Thanks for the detailed answer! When looking from this perspective, it totally makes sense to keep it as an option.
.. code-block:: bash | ||
|
||
cd eks | ||
tfenv install 1.5.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a deal breaker, just a small good to have. We can include a .terraform-version
file to manage the version to not check from README
. We can then call tfenv install
to install the correct version without specifying. This can be done in a later stage as well after this one is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pinned the version, I will update the readme as we don't need to install it manually anymore
This PR: