Skip to content

add diagnosis script #5683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 6, 2025
Merged

add diagnosis script #5683

merged 14 commits into from
May 6, 2025

Conversation

hisImminence
Copy link
Contributor

@hisImminence hisImminence commented May 1, 2025

Description

closes https://github.com/camunda/team-distribution/issues/495

This script automates the collection of logs and diagnostics from a Camunda Helm Chart deployment in a Kubernetes cluster. It gathers relevant information from the specified namespace and outputs it in a .zip file for easy sharing with the Camunda Support team.

Once this change is approved, I will backport it to the other versions.

Example output file: camunda-diagnostics-logs-20250501-082547.zip

Terminal output:

========================================
Camunda Diagnostics Collection Script
========================================
Namespace: immi-test
Output Directory: camunda-diagnostics-logs-20250501-082547
Current kubectl context: gke_camunda-distribution_europe-west1-b_distro-ci
========================================
Collecting resource information...
  - Collecting pod information (current state of all pods in the namespace).
  - Collecting cluster events (recent events in the namespace).
  - Collecting Persistent Volume Claims (PVCs) descriptions (storage claims in the namespace).
  - Collecting service information (list of services in the namespace).
  - Collecting detailed service descriptions (configuration of services).
  - Collecting endpoint information (list of endpoints in the namespace).
  - Collecting detailed endpoint descriptions (configuration of endpoints).
  - Collecting ingress descriptions (configuration of ingress resources).
  - Collecting config map information (configuration data stored in the namespace).
  - Collecting Persistent Volumes (PVs)
    - Collecting information for PV: pvc-0264b211-65ec-43f9-984d-99979ebd75d7
    - Collecting information for PV: pvc-5969ec94-d963-4fc0-ad08-a871df01af79
    - Collecting information for PV: pvc-4b1e1971-fd13-4159-9d43-a8efa2bb56fb
    - Collecting information for PV: pvc-4c4f92cf-1d05-4640-8820-7db2394b44f9
  - Collecting node information for nodes...
    - Collecting information for node: gke-distro-ci-workflow-preemptible02-ba8c70c7-bkbp
    - Collecting information for node: gke-distro-ci-workflow-preemptible02-ba8c70c7-cpvf
    - Collecting information for node: gke-distro-ci-workflow-preemptible02-ba8c70c7-gc6n
    - Collecting information for node: gke-distro-ci-workflow-preemptible02-ba8c70c7-nnbk
    - Collecting information for node: gke-distro-ci-workflow-preemptible02-ba8c70c7-pbkp
    - Collecting information for node: gke-distro-ci-workflow-spot02-001b92b6-2vqz
    - Collecting information for node: gke-distro-ci-workflow-spot02-001b92b6-5wjh
    - Collecting information for node: gke-distro-ci-workflow-spot02-001b92b6-8rbx
    - Collecting information for node: gke-distro-ci-workflow-spot02-001b92b6-hr4x
    - Collecting information for node: gke-distro-ci-workflow-spot02-001b92b6-mdxq
    - Collecting information for node: gke-distro-ci-workflow-spot02-001b92b6-szx5
    - Collecting information for node: gke-distro-ci-workflow-spot02-001b92b6-vs66
    - Collecting information for node: gke-distro-ci-workflow-spot02-001b92b6-wddz
  - Collecting logs and descriptions for each pod...
    - Collecting logs for pod: camunda-connectors-7fd4f6f6fd-lcls2
    - Collecting logs for pod: camunda-elasticsearch-master-0
    - Collecting logs for pod: camunda-identity-7bb4cc6dcc-cgbgl
    - Collecting logs for pod: camunda-keycloak-0
    - Collecting logs for pod: camunda-operate-d6b454f46-4nlmn
    - Collecting logs for pod: camunda-optimize-f99fdcd68-cns7q
    - Collecting logs for pod: camunda-postgresql-0
    - Collecting logs for pod: camunda-postgresql-web-modeler-0
    - Collecting logs for pod: camunda-tasklist-57765cd549-8cqdx
    - Collecting logs for pod: camunda-web-modeler-restapi-bf767b4b4-lmggt
    - Collecting logs for pod: camunda-web-modeler-webapp-97866d889-zm4xk
    - Collecting logs for pod: camunda-web-modeler-websockets-5cccddcf5f-lvlst
    - Collecting logs for pod: camunda-zeebe-0
    - Collecting logs for pod: camunda-zeebe-gateway-797bc6488f-qgkm7
All logs and descriptions collected.
Compressing collected diagnostics into camunda-diagnostics-logs-20250501-082547.zip...
Diagnostics collected and compressed into camunda-diagnostics-logs-20250501-082547.zip.
========================================
Diagnostics collection completed.
Please share the file 'camunda-diagnostics-logs-20250501-082547.zip' with the Camunda Support team.

To clean up the generated files and folder, run the following command:
  rm -rf camunda-diagnostics-logs-20250501-082547 camunda-diagnostics-logs-20250501-082547.zip
========================================

When should this change go live?

  • This is a bug fix, security concern, or something that needs urgent release support. (add bug or support label)
  • This is already available but undocumented and should be released within a week. (add available & undocumented label)
  • This is on a specific schedule and the assignee will coordinate a release with the Documentation team. (create draft PR and/or add hold label)
  • This is part of a scheduled alpha or minor. (add alpha or minor label)
  • There is no urgency with this change (add low prio label)

PR Checklist

  • My changes are for an upcoming minor release and are in the /docs directory (version 8.8).
  • My changes are for an already released minor and are in a /versioned_docs directory.

Copy link
Contributor

github-actions bot commented May 1, 2025

👋 🤖 ✅ Looks like the changes were ported across versions, nice job! 🎉

You can read more about the versioning within our docs in our documentation guidelines.

@hisImminence hisImminence marked this pull request as ready for review May 1, 2025 11:36
@hisImminence hisImminence requested a review from a team May 1, 2025 11:36
@hisImminence hisImminence force-pushed the add-diagnostic-script branch from 24ce35a to dd43517 Compare May 1, 2025 22:07
@hisImminence hisImminence force-pushed the add-diagnostic-script branch from 7aeaec3 to 53dc549 Compare May 2, 2025 10:25
@akeller akeller added deploy Stand up a temporary docs site with this PR component:self-managed Docs and issues related to Camunda Platform 8 Self-Managed labels May 2, 2025
@akeller
Copy link
Contributor

akeller commented May 2, 2025

@hisImminence I added the deploy label for easier reviews. Please tag tech-writers as a reviewer when you are ready for us (there are a few issues I noticed with grammar/syntax.

@github-actions github-actions bot temporarily deployed to camunda-docs May 2, 2025 19:10 Destroyed
@hisImminence hisImminence force-pushed the add-diagnostic-script branch from 53dc549 to ada7984 Compare May 5, 2025 00:09
@gustavo-camunda
Copy link
Contributor

Hi @hisImminence ,

Looks good in general, thanks! I noticed that if a Pod has been restarted, then the logic that iterates over nodes will incorrectly pick up the restart date as a node name. For example, in the following scenario:

kubectl get pod -A -o wide

NAMESPACE            NAME                                                           READY   STATUS    RESTARTS        AGE     IP            NODE                                   NOMINATED NODE   READINESS GATES
camunda-platform     camunda-tasklist-79bb8c9d85-sx6cn                              1/1     Running   1 (2d21h ago)   2d21h   10.244.1.5    camunda-platform-local-worker          <none>           <none>

Then "2d21h" will be picked up as a node name. Script output:

  - Collecting node information:
    - Collecting information for node: 2d21h
Error from server (NotFound): nodes "2d21h" not found

@hisImminence hisImminence force-pushed the add-diagnostic-script branch from 05bbba8 to 17d0a03 Compare May 5, 2025 18:36
@hisImminence
Copy link
Contributor Author

Hi @hisImminence ,

Looks good in general, thanks! I noticed that if a Pod has been restarted, then the logic that iterates over nodes will incorrectly pick up the restart date as a node name. For example, in the following scenario:

kubectl get pod -A -o wide

NAMESPACE            NAME                                                           READY   STATUS    RESTARTS        AGE     IP            NODE                                   NOMINATED NODE   READINESS GATES
camunda-platform     camunda-tasklist-79bb8c9d85-sx6cn                              1/1     Running   1 (2d21h ago)   2d21h   10.244.1.5    camunda-platform-local-worker          <none>           <none>

Then "2d21h" will be picked up as a node name. Script output:

  - Collecting node information:
    - Collecting information for node: 2d21h
Error from server (NotFound): nodes "2d21h" not found

Great catch! I fixed it using also the columns name directly -->
for node in $(kubectl get pods -n "$namespace" -o custom-columns=":spec.nodeName" --no-headers | sort | uniq); do

@hisImminence hisImminence requested a review from akeller May 5, 2025 18:38
@akeller akeller moved this to 👀 In Review in Documentation Team May 5, 2025
@akeller akeller requested review from a team and removed request for akeller May 5, 2025 18:40
jessesimpson36
jessesimpson36 previously approved these changes May 5, 2025
Copy link
Contributor

@jessesimpson36 jessesimpson36 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with these changes / tried them locally.

@github-actions github-actions bot temporarily deployed to camunda-docs May 5, 2025 19:21 Destroyed
mesellings
mesellings previously approved these changes May 6, 2025
Copy link
Contributor

@mesellings mesellings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved! Just a few comments/suggestions 🚀

@hisImminence hisImminence dismissed stale reviews from mesellings and jessesimpson36 via 0deb7dc May 6, 2025 17:52
@hisImminence
Copy link
Contributor Author

Approved! Just a few comments/suggestions 🚀

Super! Thank you @mesellings - all your reviews made sense to me :)

@hisImminence hisImminence enabled auto-merge (squash) May 6, 2025 17:54
@github-actions github-actions bot temporarily deployed to camunda-docs May 6, 2025 18:01 Destroyed
@hisImminence
Copy link
Contributor Author

p.s. need one more review to get the merging unblocked

@hisImminence hisImminence merged commit a2d8782 into main May 6, 2025
9 checks passed
@hisImminence hisImminence deleted the add-diagnostic-script branch May 6, 2025 19:12
@github-project-automation github-project-automation bot moved this from 👀 In Review to ✅ Done in Documentation Team May 6, 2025
Copy link
Contributor

github-actions bot commented May 6, 2025

🧹 Preview environment for this PR has been torn down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:self-managed Docs and issues related to Camunda Platform 8 Self-Managed deploy Stand up a temporary docs site with this PR
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

5 participants