Skip to content

[New Scheduler] Run scheduler #5194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions ansible/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,58 @@ ansible-playbook -i environments/$ENVIRONMENT prereq.yml

**Hint:** During playbook execution the `TASK [prereq : check for pip]` can show as failed. This is normal if no pip is installed. The playbook will then move on and install pip on the target machines.

### [Optional] Enable the new scheduler

You can enable the new scheduler of OpenWhisk.
It will run one more component called "scheduler" and ETCD.

#### Configure service providers for the scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any guide on setting up etcd?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you enable the scheduler by configuring enable_scheduler: true, it will be automatically deployed by this.
https://github.com/apache/openwhisk/pull/5194/files#diff-2356bb62c87e471ef37b7973eb51e82282ef1131ee7ab4b62d909102de96967cR23

You can update service providers for the scheduler as follows.

**common/scala/src/main/resources**
```
whisk.spi {
Copy link
Contributor

@bdoyle0182 bdoyle0182 Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't a provider for the scheduler db be required here? i.e. ETCDProvider?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the ETCDProvider?
If we need to use a key-value store other than ETCD, we can introduce the layer too.
But for now, I believe the code is sticking with ETCD.

ArtifactStoreProvider = org.apache.openwhisk.core.database.CouchDbStoreProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mongodb is technically supported now too right for this? It's just documentation example so not a big deal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so there is no sepecific dependency with CouchDB.
But I haven't tried with MongoDB.

ActivationStoreProvider = org.apache.openwhisk.core.database.ArtifactActivationStoreProvider
MessagingProvider = org.apache.openwhisk.connector.kafka.KafkaMessagingProvider
ContainerFactoryProvider = org.apache.openwhisk.core.containerpool.docker.DockerContainerFactoryProvider
LogStoreProvider = org.apache.openwhisk.core.containerpool.logging.DockerToActivationLogStoreProvider
LoadBalancerProvider = org.apache.openwhisk.core.loadBalancer.FPCPoolBalancer
EntitlementSpiProvider = org.apache.openwhisk.core.entitlement.FPCEntitlementProvider
AuthenticationDirectiveProvider = org.apache.openwhisk.core.controller.BasicAuthenticationDirective
InvokerProvider = org.apache.openwhisk.core.invoker.FPCInvokerReactive
InvokerServerProvider = org.apache.openwhisk.core.invoker.FPCInvokerServer
DurationCheckerProvider = org.apache.openwhisk.core.scheduler.queue.ElasticSearchDurationCheckerProvider
}
.
.
.
```

#### Enable the scheduler
- Make sure you enable the scheduler by configuring `scheduler_enable`.

**ansible/environments/local/group_vars**
```yaml
scheduler_enable: true
```

#### [Optional] Enable ElasticSearch Activation Store
When you use the new scheduler, it is recommended to use ElasticSearch as an activation store.

**ansible/environments/local/group_vars**
```yaml
db_activation_backend: ElasticSearch
elastic_cluster_name: <your elasticsearch cluster name>
elastic_protocol: <your elasticsearch protocol>
elastic_index_pattern: <your elasticsearch index pattern>
elastic_base_volume: <your elasticsearch volume directory>
elastic_username: <your elasticsearch username>
elastic_password: <your elasticsearch username>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use external elasticsearch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
It's exactly the same as the way CouchDB works with OW.

```

You can also refer to this guide to [deploy OpenWhisk using ElasticSearch](https://github.com/apache/openwhisk/blob/master/ansible/README.md#using-elasticsearch-to-store-activations).

### Deploying Using CouchDB
- Make sure your `db_local.ini` file is [setup for](#setup) CouchDB then execute:

Expand Down
6 changes: 6 additions & 0 deletions ansible/environments/local/hosts.j2.ini
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ invoker0 ansible_host=172.17.0.1 ansible_connection=local
invoker1 ansible_host=172.17.0.1 ansible_connection=local
{% endif %}

[schedulers]
scheduler0 ansible_host=172.17.0.1 ansible_connection=local
{% if mode is defined and 'HA' in mode %}
scheduler1 ansible_host=172.17.0.1 ansible_connection=local
{% endif %}

; db group is only used if db.provider is CouchDB
[db]
172.17.0.1 ansible_host=172.17.0.1 ansible_connection=local
Expand Down
62 changes: 60 additions & 2 deletions ansible/group_vars/all
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,8 @@ jmx:
rmiBasePortController: 16000
basePortInvoker: 17000
rmiBasePortInvoker: 18000
basePortScheduler: 21000
rmiBasePortScheduler: 22000
user: "{{ jmxuser | default('jmxuser') }}"
pass: "{{ jmxuser | default('jmxpass') }}"
jvmCommonArgs: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.password.file=/home/owuser/jmxremote.password -Dcom.sun.management.jmxremote.access.file=/home/owuser/jmxremote.access"
Expand Down Expand Up @@ -221,6 +223,8 @@ invoker:
keystore:
password: "{{ invoker_keystore_password | default('openwhisk') }}"
name: "{{ __invoker_ssl_keyPrefix }}openwhisk-keystore.p12"
container:
creationMaxPeek: "{{ container_creation_max_peek | default(500) }}"
reactiveSpi: "{{ invokerReactive_spi | default('') }}"
serverSpi: "{{ invokerServer_spi | default('') }}"

Expand Down Expand Up @@ -278,6 +282,9 @@ db:
invoker:
user: "{{ db_invoker_user | default(lookup('ini', 'db_username section=invoker file={{ playbook_dir }}/db_local.ini')) }}"
pass: "{{ db_invoker_pass | default(lookup('ini', 'db_password section=invoker file={{ playbook_dir }}/db_local.ini')) }}"
scheduler:
user: "{{ db_scheduler_user | default(lookup('ini', 'db_username section=scheduler file={{ playbook_dir }}/db_local.ini')) }}"
pass: "{{ db_scheduler_pass | default(lookup('ini', 'db_password section=scheduler file={{ playbook_dir }}/db_local.ini')) }}"
artifact_store:
backend: "{{ db_artifact_backend | default('CouchDB') }}"
activation_store:
Expand Down Expand Up @@ -435,8 +442,9 @@ metrics:

user_events: "{{ user_events_enabled | default(false) | lower }}"

durationChecker:
timeWindow: "{{ duration_checker_time_window | default('1 d') }}"
zeroDowntimeDeployment:
enabled: "{{ zerodowntime_deployment_switch | default(true) }}"
solution: "{{ zerodowntime_deployment_solution | default('apicall') }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems apicall is not readable here, how about use scroll or other readable words? anyway, it means scroll to deploy.

Another deployment solution is half, which means a small blue/green deployment solution

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this configuration does not take effect due to some missing parts.
I believe we can address what you mentioned in the subsequent PR when we introduce a feature for zero-downtime deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.


etcd:
version: "{{ etcd_version | default('v3.4.0') }}"
Expand All @@ -463,13 +471,63 @@ etcd_connect_string: "{% set ret = [] %}\
{% endfor %}\
{{ ret | join(',') }}"


__scheduler_blackbox_fraction: 0.10

watcher:
eventNotificationDelayMs: "{{ watcher_notification_delay | default('5000 ms') }}"

durationChecker:
timeWindow: "{{ duration_checker_time_window | default('1 d') }}"

enable_scheduler: "{{ scheduler_enable | default(false) }}"

scheduler:
protocol: "{{ scheduler_protocol | default('http') }}"
dir:
become: "{{ scheduler_dir_become | default(false) }}"
confdir: "{{ config_root_dir }}/scheduler"
basePort: 14001
grpc:
basePort: 13001
tls: "{{ scheduler_grpc_tls | default(false) }}"
maxPeek: "{{ scheduler_max_peek | default(128) }}"
heap: "{{ scheduler_heap | default('2g') }}"
arguments: "{{ scheduler_arguments | default('') }}"
instances: "{{ groups['schedulers'] | length }}"
username: "{{ scheduler_username | default('scheduler.user') }}"
password: "{{ scheduler_password | default('scheduler.pass') }}"
akka:
provider: cluster
cluster:
basePort: 25520
host: "{{ groups['schedulers'] | map('extract', hostvars, 'ansible_host') | list }}"
bindPort: 3551
# at this moment all schedulers are seed nodes
seedNodes: "{{ groups['schedulers'] | map('extract', hostvars, 'ansible_host') | list }}"
loglevel: "{{ scheduler_loglevel | default(whisk_loglevel) | default('INFO') }}"
extraEnv: "{{ scheduler_extraEnv | default({}) }}"
dataManagementService:
retryInterval: "{{ scheduler_dataManagementService_retryInterval | default('1 second') }}"
inProgressJobRetentionSecond: "{{ scheduler_inProgressJobRetentionSecond | default('20 seconds') }}"
managedFraction: "{{ scheduler_managed_fraction | default(1.0 - (scheduler_blackbox_fraction | default(__scheduler_blackbox_fraction))) }}"
blackboxFraction: "{{ scheduler_blackbox_fraction | default(__scheduler_blackbox_fraction) }}"
queueManager:
maxSchedulingTime: "{{ scheduler_maxSchedulingTime | default('20 second') }}"
maxRetriesToGetQueue: "{{ scheduler_maxRetriesToGetQueue | default(13) }}"
queue:
# the queue's state Running timeout, e.g. if have no activation comes into queue when Running, the queue state will be changed from Running to Idle and delete the decision algorithm actor
idleGrace: "{{ scheduler_queue_idleGrace | default('20 seconds') }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you said the default was to remove an idle queue after 24 hours?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 24 hours does not fit all cases.
I mentioned it because we are using 24 hours.

I am ok to change the default value but I believe each downstream will also choose the proper timeout other than the default.

The default configuration here means a queue will become idle after 20 seconds and be terminated after another 20 seconds.
In the idle status, the queue will not run any decision-making to add containers and get back to running whenever a new activation comes.

# the queue's state Idle timeout, e.g. if have no activation comes into queue when Idle, the queue state will be changed from Idle to Removed
stopGrace: "{{ scheduler_queue_stopGrace | default('20 seconds') }}"
# the queue's state Paused timeout, e.g. if have no activation comes into queue when Paused, the queue state will be changed from Paused to Removed
flushGrace: "{{ scheduler_queue_flushGrace | default('60 seconds') }}"
gracefulShutdownTimeout: "{{ scheduler_queue_gracefulShutdownTimeout | default('5 seconds') }}"
maxRetentionSize: "{{ scheduler_queue_maxRetentionSize | default(10000) }}"
maxRetentionMs: "{{ scheduler_queue_maxRetentionMs | default(60000) }}"
maxBlackboxRetentionMs: "{{ scheduler_queue_maxBlackboxRetentionMs | default(300000) }}"
throttlingFraction: "{{ scheduler_queue_throttlingFraction | default(0.9) }}"
durationBufferSize: "{{ scheduler_queue_durationBufferSize | default(10) }}"
deployment_ignore_error: "{{ scheduler_deployment_ignore_error | default('False') }}"
dataManagementService:
retryInterval: "{{ scheduler_dataManagementService_retryInterval | default('1 second') }}"
5 changes: 5 additions & 0 deletions ansible/openwhisk.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,17 @@
# playbook (currently cloudant.yml or couchdb.yml).
# It assumes that wipe.yml have being deployed at least once.

- import_playbook: etcd.yml
when: enable_scheduler

- import_playbook: kafka.yml
when: not lean

- import_playbook: controller.yml

- import_playbook: scheduler.yml
when: enable_scheduler

- import_playbook: invoker.yml
when: not lean

Expand Down
15 changes: 15 additions & 0 deletions ansible/roles/controller/tasks/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,21 @@
env: "{{ env | combine(mongodb_env) }}"
when: db.artifact_store.backend == "MongoDB"

- name: setup scheduler env
set_fact:
scheduler_env:
"CONFIG_whisk_etcd_hosts": "{{ etcd_connect_string }}"
"CONFIG_whisk_etcd_lease_timeout": "{{ etcd.lease.timeout }}"
"CONFIG_whisk_etcd_pool_threads": "{{ etcd.pool_threads }}"
"CONFIG_whisk_scheduler_grpc_tls": "{{ scheduler.grpc.tls | default('false') | lower }}"
"CONFIG_whisk_scheduler_maxPeek": "{{ scheduler.maxPeek }}"
when: enable_scheduler

- name: merge scheduler env
set_fact:
env: "{{ env | combine(scheduler_env) }}"
when: enable_scheduler

- name: populate volumes for controller
set_fact:
controller_volumes:
Expand Down
15 changes: 15 additions & 0 deletions ansible/roles/invoker/tasks/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,21 @@
env: "{{ env | combine(mongodb_env) }}"
when: db.artifact_store.backend == "MongoDB"

- name: setup scheduler env
set_fact:
scheduler_env:
"CONFIG_whisk_etcd_hosts": "{{ etcd_connect_string }}"
"CONFIG_whisk_etcd_lease_timeout": "{{ etcd.lease.timeout }}"
"CONFIG_whisk_etcd_pool_threads": "{{ etcd.pool_threads }}"
"CONFIG_whisk_scheduler_dataManagementService_retryInterval": "{{ scheduler.dataManagementService.retryInterval }}"
"CONFIG_whisk_invoker_containerCreation_maxPeek": "{{ invoker.container.creationMaxPeek }}"
when: enable_scheduler

- name: merge scheduler env
set_fact:
env: "{{ env | combine(scheduler_env) }}"
when: enable_scheduler

- name: include plugins
include_tasks: "{{ inv_item }}.yml"
with_items: "{{ invoker_plugins | default([]) }}"
Expand Down
24 changes: 24 additions & 0 deletions ansible/roles/schedulers/tasks/clean.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
# Remove scheduler containers.

- name: get scheduler name
set_fact:
scheduler_name: "{{ name_prefix ~ host_group.index(inventory_hostname) }}"

- name: remove scheduler
docker_container:
name: "{{ scheduler_name }}"
state: absent
ignore_errors: "True"

- name: remove scheduler log directory
file:
path: "{{ whisk_logs_dir }}/{{ scheduler_name }}"
state: absent
become: "{{ logs.dir.become }}"

- name: remove scheduler conf directory
file:
path: "{{ scheduler.confdir }}/{{ scheduler_name }}"
state: absent
become: "{{ scheduler.dir.become }}"
Loading