Skip to content

Dashboards readiness probe #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Dashboards readiness probe #117

wants to merge 4 commits into from

Conversation

Sebor
Copy link
Contributor

@Sebor Sebor commented Nov 2, 2021

Description

Add k8s readiness probe to opensearch-dashboards helm chart

Related issues

Check List

  • [ x ] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Sebor
Copy link
Contributor Author

Sebor commented Nov 2, 2021

Hi @TheAlgo, @peterzhuamazon, @smlx !
Could you review this PR?

@peterzhuamazon
Copy link
Member

Hi @Sebor thanks for the PR could you please create an issue and link to this PR?
This would make it easier for people to check.

Thanks.

@Sebor
Copy link
Contributor Author

Sebor commented Nov 3, 2021

Hi @peterzhuamazon!
I've linked the issue. But what about tests? I think an opensearch server also should be deployed, otherwise opensearch-dashboards always will fail due to readiness probe

@peterzhuamazon
Copy link
Member

Hi @peterzhuamazon! I've linked the issue. But what about tests? I think an opensearch server also should be deployed, otherwise opensearch-dashboards always will fail due to readiness probe

Are you talking about your own testing?
Also I see LINT check failed again probably on version.

@Sebor
Copy link
Contributor Author

Sebor commented Nov 3, 2021

Hi @peterzhuamazon! I've linked the issue. But what about tests? I think an opensearch server also should be deployed, otherwise opensearch-dashboards always will fail due to readiness probe

Are you talking about your own testing? Also I see LINT check failed again probably on version.

No I mean LINT check here, in PR (Lint and Test Charts / lint-test (pull_request) ):

Error: INSTALLATION FAILED: timed out waiting for the condition
...
Warning    Unhealthy    pod/opensearch-dashboards-6z2migwhtu-8cbd64465-tgb98    spec.containers{dashboards}   kubelet, chart-testing-control-plane   Readiness probe failed: Error: Got HTTP code 503 but expected a 200 
...
{"type":"log","@timestamp":"2021-11-03T00:52:41Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-cluster-master opensearch-cluster-master:9200"}

@peterzhuamazon
Copy link
Member

Hi @peterzhuamazon! I've linked the issue. But what about tests? I think an opensearch server also should be deployed, otherwise opensearch-dashboards always will fail due to readiness probe

Are you talking about your own testing? Also I see LINT check failed again probably on version.

No I mean LINT check here, in PR (Lint and Test Charts / lint-test (pull_request) ):

Error: INSTALLATION FAILED: timed out waiting for the condition
...
Warning    Unhealthy    pod/opensearch-dashboards-6z2migwhtu-8cbd64465-tgb98    spec.containers{dashboards}   kubelet, chart-testing-control-plane   Readiness probe failed: Error: Got HTTP code 503 but expected a 200 
...
{"type":"log","@timestamp":"2021-11-03T00:52:41Z","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: getaddrinfo EAI_AGAIN opensearch-cluster-master opensearch-cluster-master:9200"}

Seem like it. Good catch here.
Is it possible for you to add this functionalities in another PR?
Thanks.

@peterzhuamazon
Copy link
Member

@mprimeaux any inputs? ^^

@sastorsl
Copy link
Contributor

sastorsl commented Nov 3, 2021

@Sebor I was wondering, do we really want to log on to opensearch in the readiness check?
I not sure I see the need to start taking down Opensearch Dashboards if the backend is unavailable for a spell?

A simple call to the root (or base path) will give you a 302 (redirect to login), and it will at least verify that that Opensearch Dashboards is responding.
I think that depending on the backend makes things more complicated than needs be, and instead of fixing the backend you also have to struggle with Dashboards going down and up.

What are your thoughts on this?

@Sebor
Copy link
Contributor Author

Sebor commented Nov 4, 2021

Hi @sastorsl!
Readiness probe does not log on to opensearch. It checks dashboards only. And dashboards will always give 503 error without connection to opensearch.
You can test it locally:

  • run opensearch-dashboards container via docker:
    docker run -d -p 5601:5601 opensearchproject/opensearch-dashboards
  • do curl request:
    curl -v localhost:5601
    The answer will be something like:
    *   Trying 127.0.0.1:5601...
    * Connected to localhost (127.0.0.1) port 5601 (#0)
    > GET / HTTP/1.1
    > Host: localhost:5601
    > User-Agent: curl/7.76.1
    > Accept: */*
    >
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 503 Service Unavailable
    < retry-after: 30
    < content-type: text/html; charset=utf-8
    < cache-control: private, no-cache, no-store, must-revalidate
    < content-length: 45
    < Date: Thu, 04 Nov 2021 11:02:26 GMT
    < Connection: keep-alive
    <
    * Connection #0 to host localhost left intact
    OpenSearch Dashboards server is not ready yet
    
  • see docker logs:
    docker logs -f <container_name>
    You'll see a lot of messages:
    {"type":"log","@timestamp":"","tags":["error","opensearch","data"],"pid":1,"message":"[ConnectionError]: connect ECONNREFUSED 127.0.0.1:9200"}
    

That's why test fails. I.e. it is common behavior of dashboards and not readiness probe

@sastorsl
Copy link
Contributor

sastorsl commented Nov 4, 2021

@Sebor you are quite right.
I'll leave it in the open and let the record show, etc... :-)

@peterzhuamazon
Copy link
Member

@DandyDeveloper @TheAlgo any idea how to proceed with this PR? Thanks.

@DandyDeveloper
Copy link
Collaborator

DandyDeveloper commented Nov 11, 2021

I need to investigate and approve if things look good later today. Also, we'll need to pull in master to add the other tests.

@TheAlgo
Copy link
Member

TheAlgo commented Nov 11, 2021

Yes some investigation is required here to confirm the behaviour

@Sebor
Copy link
Contributor Author

Sebor commented Nov 11, 2021

I can propose "dirty" workaround: to add flag to disable\enable readiness probe in chart (disabled by default). For example in values.yaml:

readinessProbe:
  enabled: false

What do you think?

@DandyDeveloper
Copy link
Collaborator

@Sebor I don't like the dirty option here, because it's technically working around the issue.

I'm not entirely opposed to adding a enabled: false to the CI, but then you avoid testing this entirely, which defeats the purpose of the readiness probes.

Consider this is testing for readiness, a better option would be to include opensearch as a dependency in the workflow and deploy a single node cluster alongside the opensearch-dashboards, so we have a working stack and a "full test" of your readiness probes.

@peterzhuamazon
Copy link
Member

@Sebor any progress to follow up with what @DandyDeveloper has mentioned above?
Thanks.

@Sebor
Copy link
Contributor Author

Sebor commented Nov 17, 2021

Hi @peterzhuamazon!
I don't quite understand what is expected of me

@peterzhuamazon
Copy link
Member

Hi @peterzhuamazon! I don't quite understand what is expected of me

@DandyDeveloper could you clear up what is the next step that @Sebor can proceed with this PR?
It has been some time since we have this PR.

Thanks.

@DandyDeveloper
Copy link
Collaborator

@Sebor @peterzhuamazon I was hoping other opinions would weigh in.

I think we should be completely testing this otherwise we're going to get issues down the road.

To implement this properly, we need to have a proper CI mechanism that matches the default values. This means;

  • DIsable is by default in the values AND therefore the CI and include a note in the README the impact this has by enabling
    OR
  • We find a way to remove the dependency of Opensearch for the readiness in the Dashboards.

@TheAlgo
Copy link
Member

TheAlgo commented Nov 22, 2021

I feel it is difficult to remove the dependency of dashboards for readiness unless we don't want to go with the standard way to get the health for dashboards. We can find some other variant which can help determine readiness but that can be ambiguous and will require some POC to be done.

Just thinking out loud. What if we can in our CI first install OpenSearch and then do OpenSearch Dashboards everytime we have a change in OpenSearch Dashboards. Opinions?

@DandyDeveloper
Copy link
Collaborator

@TheAlgo I don't like adding a dependency of having Opensearch deployed, especially in the CI, for the Opensearch Dashboard tests to run. Its a simple solution, but we're worse off long term.

@sastorsl
Copy link
Contributor

Is there any resource which is available in OpenSearch Dashboards even if OpenSearch is not available? An image, favicon, html file, anything?

@Sebor
Copy link
Contributor Author

Sebor commented Nov 23, 2021

Is there any resource which is available in OpenSearch Dashboards even if OpenSearch is not available? An image, favicon, html file, anything?

Simple text only:

$ curl -vL 127.1:5601
*   Trying 127.0.0.1:5601...
* Connected to 127.0.0.1 (127.0.0.1) port 5601 (#0)
> GET / HTTP/1.1
> Host: 127.0.0.1:5601
> User-Agent: curl/7.77.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 503 Service Unavailable
< retry-after: 30
< content-type: text/html; charset=utf-8
< cache-control: private, no-cache, no-store, must-revalidate
< content-length: 45
< Date: Tue, 23 Nov 2021 14:29:43 GMT
< Connection: keep-alive
<
* Connection #0 to host 127.0.0.1 left intact
OpenSearch Dashboards server is not ready yet

and 503 HTTP response code

@peterzhuamazon
Copy link
Member

It seems this PR has been stuck for too long.
What would be a way for this to get approved?
@DandyDeveloper @TheAlgo

Thanks.

@DandyDeveloper
Copy link
Collaborator

@peterzhuamazon I don't have an alternative option to how this could be better at this point.

I think deeper considerations are needed on pulling the trigger on this. Adding the CI dependency is not a great idea to me.

Having an option to disable this in the CI also just hides the problem.

What about raising this on the engineering side to see if a better mechanism for readiness (unauthenticated) can be made available in Opensearch Dashboards, one that also doesn't depend on the upstream Opensearch to confirm readines?

@sastorsl
Copy link
Contributor

Repeating my question of last...

Is there any resource which is available in OpenSearch Dashboards even if OpenSearch is not available? An image, favicon, html file, anything?

@sastorsl
Copy link
Contributor

sastorsl commented Dec 13, 2021

I attempted to call various resources such as /favicons/favicon.ico, etc and also got only HTTP response 503.

curl -f -v http://localhost:5601/bootstrap.js
curl -f -v http://localhost:5601/favicon.ico
curl -f -v http://localhost:5601/favicons/favicon.ico
curl -f -v http://localhost:5601/app/home
curl -f -v http://localhost:5601/auth/
curl -f -v http://localhost:5601/auth/openid/
curl -f -v http://localhost:5601/auth/openid/login/

But there is one thing, and that is the port is listening, and actually determines that a HTTP response code 503 is a valid response.

When you connect to an invalid port curl returns errorcode 7 Failed to connect to host.
When connecting to port 5601 (default port) with curl ... you get HTTP response 503, but returncode 0 or returncode 22 with curl -f ....

It is highly inconclusive - BUT there is something listening if you get at least a return code 0 from curl.
It is a start.

However I agree with @DandyDeveloper , this is really something the opensearch-dashboards team should look into or answer.
Present one alternative URL where the application returns 200, while anything else requires a backend probe.

localhost:5601/health  # Should return 200 as long as the application is up
localhost:5601/<anything-else>  # Should only return 200 (or 30x, 40x) if the backend is available

@TheAlgo
Copy link
Member

TheAlgo commented Dec 13, 2021

I attempted to call various resources such as /favicons/favicon.ico, etc and also got only HTTP response 503.

curl -f -v http://localhost:5601/bootstrap.js
curl -f -v http://localhost:5601/favicon.ico
curl -f -v http://localhost:5601/favicons/favicon.ico
curl -f -v http://localhost:5601/app/home
curl -f -v http://localhost:5601/auth/
curl -f -v http://localhost:5601/auth/openid/
curl -f -v http://localhost:5601/auth/openid/login/

But there is one thing, and that is the port is listening, and actually determines that a 503 is a valid response.

When you connect to an invalid port curl returns errorcode 7 Failed to connect to host and returncode 0. When connecting to port 5601 (default port) with curl ... you get return code 0 and with curl -f ... you get a returncode 22.

It is highly inconclusive - BUT there is something listening if you get at least a return code 0 from curl. It is a start.

But I agree with @DandyDeveloper , this is really something the opensearch-dashboards team should look into or answer. Present one alternative URL where the application returns 200, while anything else requires a backend probe.

localhost:5601/health  # Should return 200 as long as the application is up
localhost:5601/<anything-else>  # Should only return 200 (or 30x, 40x) if the backend is available

@CEHENKLE Can someone from Dashboards team give some guidance here?

@peterzhuamazon
Copy link
Member

I attempted to call various resources such as /favicons/favicon.ico, etc and also got only HTTP response 503.

curl -f -v http://localhost:5601/bootstrap.js
curl -f -v http://localhost:5601/favicon.ico
curl -f -v http://localhost:5601/favicons/favicon.ico
curl -f -v http://localhost:5601/app/home
curl -f -v http://localhost:5601/auth/
curl -f -v http://localhost:5601/auth/openid/
curl -f -v http://localhost:5601/auth/openid/login/

But there is one thing, and that is the port is listening, and actually determines that a 503 is a valid response.
When you connect to an invalid port curl returns errorcode 7 Failed to connect to host and returncode 0. When connecting to port 5601 (default port) with curl ... you get return code 0 and with curl -f ... you get a returncode 22.
It is highly inconclusive - BUT there is something listening if you get at least a return code 0 from curl. It is a start.
But I agree with @DandyDeveloper , this is really something the opensearch-dashboards team should look into or answer. Present one alternative URL where the application returns 200, while anything else requires a backend probe.

localhost:5601/health  # Should return 200 as long as the application is up
localhost:5601/<anything-else>  # Should only return 200 (or 30x, 40x) if the backend is available

@CEHENKLE Can someone from Dashboards team give some guidance here?

@tmarkley @kavilla Could use some help here for dashboards.
Thanks.

@kavilla
Copy link
Member

kavilla commented Jan 3, 2022

Hey! OpenSearch Dashboards depends on OpenSearch to do anything so with that said having OpenSearch running is a major part of the startup process for OpenSearch Dashboards. If there is no OpenSearch then it won't try to serve up anything. So arguably, 503 is the correct code and implementing something so that it returns a 200 seems no different than:

readinessProbe:
  enabled: false

I think the correct solution would be adding the dependency.

@sastorsl
Copy link
Contributor

sastorsl commented Jan 4, 2022

Hey! OpenSearch Dashboards depends on OpenSearch to do anything so with that said having OpenSearch running is a major part of the startup process for OpenSearch Dashboards. If there is no OpenSearch then it won't try to serve up anything. So arguably, 503 is the correct code and implementing something so that it returns a 200 seems no different than:

To actually be able to do something OpenSearch Dashboards requires Opensearch. Still, there is a need to determine if OpenSearch Dashboard itself is ready.

"The troops are ready, just waiting for the general, Sir!"

@tmarkley
Copy link

tmarkley commented Jan 4, 2022

Still, there is a need to determine if OpenSearch Dashboard itself is ready.

@sastorsl As the OpenSearch Dashboards code base exists now, the server fails to start and is not ready in any way if OpenSearch is not running. Do we need to further discuss that point in this PR? Maybe we should be asking what the purpose of this readiness probe is?

As a maintainer of Dashboards I don't believe this is as trivial as adding a "health" API because it wouldn't properly represent the state of the application. If anything we should be checking the health of OpenSearch before Dashboards is started.

@sastorsl
Copy link
Contributor

sastorsl commented Jan 4, 2022

@tmarkley there are comments as to why a minimum test of OpenSearch Dasboards is both desirable and necessary but I'll try to summarize.

  • CI/CD test just OpenSearch Dashboards and see if it is broken.
  • Able to deploy OpenSearch Dashboards in an unordered sequence and leave it open until OpenSearch is available.
  • OpenSearch Dashboards not going down if OpenSearch goes down.

OpenSearch Dashboards requires the backend / OpenSearch to function properly, but still it is a standalone application and should be able to simply report that it is up but that is is missing it's backend.

OpenSearch Dashboards going down when OpenSearch goes down also adds to what must be restarted / checked.

When automating deployment it makes sense to have separate deployment pipelines for OpenSearch and OpenSearch Dashboards since there are two helm charts and they receive updates at uneven intervals. This also means that at first deploy Dashboards might be ready first - and probably is, but then it goes down and up until OpenSearch comes up which needlessly stresses systems and also adds to what ops have to follow up / search for errors in.

These are real life experiences in my case, and I would guess for others as well.

@tmarkley
Copy link

tmarkley commented Jan 5, 2022

@sastorsl Thanks for that information, that helps clarify a few things for me. I think these are reasonable requests, but they're going to require some discussion about the path forward for OpenSearch Dashboards. Dashboards is a standalone application is some ways, but historically it's always been coupled to ES/OpenSearch. I don't immediately know the scope required to change this status quo, but again I don't think it's trivial. Do you mind opening an Issue in the OpenSearch Dashboards repo so that we can discuss the options, gauge community interest, and determine the scope of changes that might be required to reach this desired state?

@TheAlgo
Copy link
Member

TheAlgo commented Jan 5, 2022

@tmarkley Just thinking out loud. At the end OpenSearch Dashboards is a Node JS Server so ideally there should be a way out to check the health of it.

This thing might be helpful in creating Kubernetes operator as well I feel 🤔

@tmarkley
Copy link

tmarkley commented Jan 5, 2022

Just thinking out loud. At the end OpenSearch Dashboards is a Node JS Server so ideally there should be a way out to check the health of it.

Yeah, I agree that it's a reasonable thing to expect. A method for that just doesn't currently exist.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement][opensearch-dashboards] Add k8s readiness probe
7 participants