Skip to content

[New Scheduler] Add FPCPoolBalancer #5158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

style95
Copy link
Member

@style95 style95 commented Sep 15, 2021

Description

This is to add FPCPoolBalancer

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@style95 style95 changed the title Add FPCPoolBalancer [New Scheduler] Add FPCPoolBalancer Sep 15, 2021
@codecov-commenter
Copy link

codecov-commenter commented Sep 15, 2021

Codecov Report

Merging #5158 (0449b4d) into master (9633043) will decrease coverage by 3.77%.
The diff coverage is 31.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5158      +/-   ##
==========================================
- Coverage   45.50%   41.72%   -3.78%     
==========================================
  Files         234      235       +1     
  Lines       13389    13655     +266     
  Branches      551      546       -5     
==========================================
- Hits         6092     5698     -394     
- Misses       7297     7957     +660     
Impacted Files Coverage Δ
...rg/apache/openwhisk/common/AverageRingBuffer.scala 27.27% <ø> (ø)
...enwhisk/core/loadBalancer/CommonLoadBalancer.scala 0.00% <0.00%> (ø)
...che/openwhisk/core/loadBalancer/LoadBalancer.scala 0.00% <0.00%> (ø)
...e/loadBalancer/ShardingContainerPoolBalancer.scala 0.00% <ø> (ø)
...e/openwhisk/core/scheduler/queue/MemoryQueue.scala 83.73% <ø> (ø)
.../openwhisk/core/loadBalancer/FPCPoolBalancer.scala 33.06% <33.06%> (ø)
...pache/openwhisk/http/LenientSprayJsonSupport.scala 0.00% <0.00%> (-100.00%) ⬇️
...g/apache/openwhisk/common/ResizableSemaphore.scala 0.00% <0.00%> (-88.47%) ⬇️
...ache/openwhisk/utils/ExecutionContextFactory.scala 7.69% <0.00%> (-76.93%) ⬇️
...n/scala/org/apache/openwhisk/utils/JsHelpers.scala 0.00% <0.00%> (-55.56%) ⬇️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9633043...0449b4d. Read the comment docs.

@style95
Copy link
Member Author

style95 commented Dec 29, 2021

I found the environment variable for unit tests is not properly exported.

travis@travis-job-f528e04c-ee92-449c-a66c-0c4d61bfde0b:~/build/apache/openwhisk$ env | grep GRADLE
GRADLE_PROJS_SKIP=

So some tests which is excluded from unit tests are being executed.

@style95
Copy link
Member Author

style95 commented Jan 3, 2022

Getting this from the performance tests.

[2022-01-03T06:57:40.498Z] [ERROR] [#tid_throughput-async_110] [ShardingContainerPoolBalancer] failed to schedule activation d63b3cc02d2e4004bb3cc02d2eb004cc, action 'guest/[email protected]' (managed), ns 'guest' - invokers to use: Map(Unresponsive -> 1)
[2022-01-03T06:57:40.498Z] [ERROR] [#tid_throughput-async_110] [ActionsApi] [POST] failed in loadbalancer: No invokers available
[2022-01-03T06:57:40.518Z] [ERROR] [#tid_throughput-async_110] [ShardingContainerPoolBalancer] failed to schedule activation 5036a212b31946c3b6a212b31996c391, action 'guest/[email protected]' (managed), ns 'guest' - invokers to use: Map(Unresponsive -> 1)
[2022-01-03T06:57:40.518Z] [WARN] [#tid_throughput-async_110] [ActionsApi] No invokers available [marker:controller_loadbalancer_error:48:0]
[2022-01-03T06:57:40.518Z] [WARN] [#tid_throughput-async_110] [ActionsApi] No invokers available [marker:controller_blockingActivation_error:48:0]
[2022-01-03T06:57:40.518Z] [ERROR] [#tid_throughput-async_110] [ActionsApi] [POST] failed in loadbalancer: No invokers available
[2022-01-03T06:57:40.518Z] [INFO] [#tid_throughput-async_110] [BasicHttpService] [marker:http_post.503_counter:48:48]
[2022-01-03T06:57:45.344Z] [ERROR] null
java.util.NoSuchElementException: null
        at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
        at org.apache.openwhisk.common.NestedSemaphore.releaseConcurrent(NestedSemaphore.scala:103)
        at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.$anonfun$releaseInvoker$1(ShardingContainerPoolBalancer.scala:329)
        at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.$anonfun$releaseInvoker$1$adapted(ShardingContainerPoolBalancer.scala:329)
        at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer$$Lambda$2777/0000000050138B60.apply(Unknown Source)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.releaseInvoker(ShardingContainerPoolBalancer.scala:329)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.$anonfun$processCompletion$4(CommonLoadBalancer.scala:309)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.$anonfun$processCompletion$4$adapted(CommonLoadBalancer.scala:309)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer$$Lambda$2776/0000000050137AF0.apply(Unknown Source)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.processCompletion(CommonLoadBalancer.scala:309)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.$anonfun$setupActivation$6(CommonLoadBalancer.scala:165)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer$$Lambda$2750/000000004C12E6C0.apply$mcV$sp(Unknown Source)
        at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:475)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
[2022-01-03T06:57:45.364Z] [ERROR] null
java.util.NoSuchElementException: null
        at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
        at org.apache.openwhisk.common.NestedSemaphore.releaseConcurrent(NestedSemaphore.scala:103)
        at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.$anonfun$releaseInvoker$1(ShardingContainerPoolBalancer.scala:329)
        at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.$anonfun$releaseInvoker$1$adapted(ShardingContainerPoolBalancer.scala:329)
        at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer$$Lambda$2777/0000000050138B60.apply(Unknown Source)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.releaseInvoker(ShardingContainerPoolBalancer.scala:329)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.$anonfun$processCompletion$4(CommonLoadBalancer.scala:309)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.$anonfun$processCompletion$4$adapted(CommonLoadBalancer.scala:309)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer$$Lambda$2776/0000000050137AF0.apply(Unknown Source)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.processCompletion(CommonLoadBalancer.scala:309)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.$anonfun$setupActivation$6(CommonLoadBalancer.scala:165)
        at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer$$Lambda$2750/000000004C12E6C0.apply$mcV$sp(Unknown Source)
        at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:475)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:48)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)

@style95
Copy link
Member Author

style95 commented Jan 3, 2022

It seems during the performance tests, connection errors against runtime containers are happening.

[2022-01-03T06:53:49.830Z] [INFO] [#tid_throughput-noop_110] [CouchDbRestStore]  [marker:database_saveDocument_finish:17:10]
[2022-01-03T06:53:57.275Z] [INFO] [#tid_throughput-noop_110] [DockerContainer] running result: ConnectionError(akka.stream.StreamTcpException: The connection closed with error: Connection reset by peer) [marker:invoker_activationRun_finish:12067:18]
[2022-01-03T06:53:57.278Z] [INFO] [#tid_throughput-noop_110] [CouchDbRestStore] [PUT] 'whisk_local_activations' saving document: 'id: guest/cd38a697126f4b0fb8a697126f8b0f05, rev: null' [marker:database_saveDocument_start:12070]
[2022-01-03T06:53:57.279Z] [INFO] [#tid_sid_dbBatcher] [CouchDbRestStore] 'whisk_local_activations' saving 1 documents [marker:database_saveDocumentBulk_start:883170]
[2022-01-03T06:53:57.280Z] [INFO] [#tid_throughput-noop_110] [MessagingActiveAck] posted combined of activation cd38a697126f4b0fb8a697126f8b0f05
[2022-01-03T06:53:57.283Z] [ERROR] [#tid_sid_unknown] [ContainerProxy] Failed during use of warm container Some(ContainerId(beab41a139fd6b49e07712a61973d614554003b83df04149f112e8613ce696c1)), queued activations will be resent.

@style95
Copy link
Member Author

style95 commented Jan 3, 2022

I suspect the Travis VM cannot handle the docker workload for some reason.
I can observe many error logs complaining about missing veth from the syslog.

07:33:10 localhost systemd-udevd[78934]: veth426f4b3: Failed to get link config: No such device

@style95
Copy link
Member Author

style95 commented Jan 3, 2022

I found a small portion of tests were already being failed in another test even if it is marked as PASSED.
There were around 16 activations with non-2xx,3xx responses.

$ TERM=dumb ./tests/performance/wrk_tests/throughput.sh "https://172.17.0.1:10001" "$(cat ansible/files/auth.guest)" ./tests/performance/preparation/actions/async.js 100 110 2 2m
Creating action async_110
{"annotations":[{"key":"provide-api-key","value":false},{"key":"exec","value":"nodejs:10"}],"exec":{"kind":"nodejs:10","code":"/*\n * Licensed to the Apache Software Foundation (ASF) under one or more\n * contributor license agreements.  See the NOTICE file distributed with\n * this work for additional information regarding copyright ownership.\n * The ASF licenses this file to You under the Apache License, Version 2.0\n * (the \"License\"); you may not use this file except in compliance with\n * the License.  You may obtain a copy of the License at\n *\n *     http://www.apache.org/licenses/LICENSE-2.0\n *\n * Unless required by applicable law or agreed to in writing, software\n * distributed under the License is distributed on an \"AS IS\" BASIS,\n * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n * See the License for the specific language governing permissions and\n * limitations under the License.\n */\n\nfunction main() {\n  return new Promise(function (resolve, reject) {\n    setTimeout(function () {\n      resolve({done: true});\n    }, 175);\n  })\n}","binary":false},"limits":{"concurrency":110,"logs":10,"memory":256,"timeout":60000},"name":"async_110","namespace":"guest","parameters":[],"publish":false,"updated":1612280953586,"version":"0.0.1"}Running async_110 once to assert an intact system
{"activationId":"83d84193c4414def984193c4418def45","annotations":[{"key":"path","value":"guest/async_110"},{"key":"waitTime","value":1469},{"key":"kind","value":"nodejs:10"},{"key":"timeout","value":false},{"key":"limits","value":{"concurrency":110,"logs":10,"memory":256,"timeout":60000}},{"key":"initTime","value":374}],"duration":555,"end":1612280955799,"logs":[],"name":"async_110","namespace":"guest","publish":false,"response":{"result":{"done":true},"size":13,"status":"success","success":true},"start":1612280955244,"subject":"guest","version":"0.0.1"}Running 2m test @ https://172.17.0.1:10001/api/v1/namespaces/_/actions/async_110?blocking=true
  2 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   610.72ms  332.99ms   6.35s    95.03%
    Req/Sec    84.71     58.64   370.00     71.60%
  Latency Distribution
     50%  560.17ms
     75%  671.19ms
     90%  801.36ms
     99%    1.51s 
  19156 requests in 2.00m, 17.49MB read
  Socket errors: connect 0, read 0, write 0, timeout 16
  Non-2xx or 3xx responses: 16
Requests/sec:    159.51
Transfer/sec:    149.11KB

https://app.travis-ci.com/github/apache/openwhisk/jobs/479539830

Since we are running lots of components and wrk docker client on a VM with 2 cores and 8GB memory, I feel it doesn't have enough resources to run performance tests on it.

@style95
Copy link
Member Author

style95 commented Jan 3, 2022

I suspect this issue comes to the fore because of the docker version upgrade.
Previously we were using docker 19.03.8, but now the version on a Travis VM is 20.10.7.

docker version
Client:
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.13.8
 Git commit:        afacb8b7f0
 Built:             Fri Dec 18 12:15:19 2020
 OS/Arch:           linux/amd64
 Experimental:      false
Server:
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       afacb8b7f0
  Built:            Fri Dec  4 23:02:49 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.3.3-0ubuntu2.2
  GitCommit:        
 runc:
  Version:          spec: 1.0.1-dev
  GitCommit:        
 docker-init:
  Version:          0.18.0
  GitCommit:        

https://app.travis-ci.com/github/apache/openwhisk/jobs/479539830

travis@travis-job-9229aa98-c0d2-4c78-bedd-9b283d2b4c16:~/build/apache/openwhisk$ docker version
Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.8
 Git commit:        20.10.7-0ubuntu5~20.04.2
 Built:             Mon Nov  1 00:34:17 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       20.10.7-0ubuntu5~20.04.2
  Built:            Fri Oct 22 00:45:53 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.5-0ubuntu3~20.04.1
  GitCommit:
 runc:
  Version:          1.0.1-0ubuntu2~20.04.1
  GitCommit:
 docker-init:
  Version:          0.19.0
  GitCommit:

@style95 style95 force-pushed the add-fpc-pool-balancer branch from 5defb4b to 7f71e79 Compare January 3, 2022 10:49
style95 added a commit to style95/openwhisk that referenced this pull request Jan 4, 2022
.travis.yml Outdated
- OPENWHISK_HOST="172.17.0.1" CONNECTIONS="100" REQUESTS_PER_SEC="1" ./gradlew gatlingRun-org.apache.openwhisk.ApiV1Simulation
- OPENWHISK_HOST="172.17.0.1" MEAN_RESPONSE_TIME="1000" API_KEY="$(cat ansible/files/auth.guest)" EXCLUDED_KINDS="python:default,java:default,swift:default" PAUSE_BETWEEN_INVOKES="100" ./gradlew gatlingRun-org.apache.openwhisk.LatencySimulation
- OPENWHISK_HOST="172.17.0.1" API_KEY="$(cat ansible/files/auth.guest)" CONNECTIONS="100" REQUESTS_PER_SEC="1" ./gradlew gatlingRun-org.apache.openwhisk.BlockingInvokeOneActionSimulation
- OPENWHISK_HOST="172.17.0.1" API_KEY="$(cat ansible/files/auth.guest)" CONNECTIONS="100" REQUESTS_PER_SEC="1" ASYNC="true" ./gradlew gatlingRun-org.apache.openwhisk.BlockingInvokeOneActionSimulation
# The following configuration does not make much sense. But we do not have enough users. But it's good to verify, that the test is still working.
- OPENWHISK_HOST="172.17.0.1" USERS="1" REQUESTS_PER_SEC="1" ./gradlew gatlingRun-org.apache.openwhisk.ColdBlockingInvokeSimulation
- TERM=dumb ./tests/performance/wrk_tests/latency.sh "https://172.17.0.1:10001" "$(cat ansible/files/auth.guest)" ./tests/performance/preparation/actions/noop.js 2m
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order to pass the performance test.
But it does not mean the performance test is fully working, still tests with wrk complain about activations with non-2xx,3xx responses.

IMHO, we need a more stable environment to run these tests.
So I opened a new issue, #5190

.travis.yml Outdated
@@ -85,19 +85,20 @@ jobs:
- ./tools/travis/checkAndUploadLogs.sh standalone
name: "Standalone Tests"
- script:
- sed -i "s@pause-grace = 50 milliseconds@pause-grace = 10 seconds@g" ./core/invoker/src/main/resources/application.conf
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to minimize too frequent status changes(PAUSED <-> RUNNING).
It looks the system cannot forward activations to runtime containers in time due to limited resources.

@style95 style95 force-pushed the add-fpc-pool-balancer branch 2 times, most recently from 50bf6a9 to 983c8d2 Compare January 4, 2022 08:47
@style95 style95 force-pushed the add-fpc-pool-balancer branch from 983c8d2 to 3fdd05a Compare January 4, 2022 08:48
@style95 style95 closed this Jan 5, 2022
This was referenced Jan 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants