Skip to content

KF1.7 Applications not Syncing #1762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wg102 opened this issue Jul 10, 2023 · 5 comments · Fixed by StatCan/aaw-kubeflow-manifests#373
Closed

KF1.7 Applications not Syncing #1762

wg102 opened this issue Jul 10, 2023 · 5 comments · Fixed by StatCan/aaw-kubeflow-manifests#373
Assignees

Comments

@wg102
Copy link
Contributor

wg102 commented Jul 10, 2023

Kserve stuck in Progressing" -- solved

Original ticket were the issue was found : #1752
Getting imagepullbackoff, probably because gatekeeper is blocking image: 'kserve/kserve-controller:v0.10.0'... Having said that the error does say, "Back-off restarting failed container"
Getting: {"level":"error","ts":1688561752.6363223,"logger":"entrypoint","msg":"unable to get ingress config.","error":"unable to parse ingress config json: invalid character '}' looking for beginning of object key string","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:250"} from the manager container
We are currently using kserve/kserve-controller:v0.8.0 in prod, hence I don't think the gatekeeper is the reason it doesn't work

Looking at the pods log we see

{"level":"error","ts":1688561752.6363223,"logger":"entrypoint","msg":"unable to get ingress config.","error":"unable to parse ingress config json: invalid character '}' looking for beginning of object key string","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

According to an article on stackoverflow, this would be most likely malformed json file.

The other application is

Monitoring-system stuck at Degraded Health

Monitoring-system one became degraded. It's for the ElasticSearch reousrce in monitoring-system and it's because the Health/Status is yellow, have we encountered this before / how did we fix it? (I can't seem to find any issues in our repos that mention it aside from our new one)

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Jul 10, 2023

With monitoring-system stuck at degraded health, if you click on the app that is degraded it says "ElasticSearch Cluster status is yellow, check the status of indices, replicas and shards" so while yes it's a problem. I'm not sure if its 100% related to the upgrade.

I can't remember if it was degraded before / when we first upgraded it, I didnt think it was but I may be mistaken.

Maybe this case happened where during some cluster operation it kinda dies.

@wg102
Copy link
Contributor Author

wg102 commented Jul 11, 2023

K Serve stuck at progressing:
image

The issue was indeed a malformed file. the ingress has one comma too many. Once that was edited in the live manifest, the application is out of sync (since there was a manual edit) but Healthy

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Jul 18, 2023

Monitoring System Investigation

Port-forwarding elastic-monitoring-es-nodes-0 to 9200 and using

curl -X GET "localhost:9200/_cluster/health?filter_path=status,*_shards&pretty" -u jose.matsuda
I get

{
  "status" : "yellow",
  "active_primary_shards" : 117,
  "active_shards" : 117,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0
}

I'm trying to use the cURL described here but I don't seem to have access?
Using this one at the bottom of curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -u jose.matsuda

provides

{
  "note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "index" : ".ds-heartbeat-8.7.1-2023.06.09-000002",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2023-06-09T13:04:03.984Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster. Choose a node to which you expect this shard to be allocated, find this node in the node-by-node explanation, and address the reasons which prevent Elasticsearch from allocating this shard there.",
  "node_allocation_decisions" : [
    {
      "node_id" : "F...Q",
      "node_name" : "elastic-monitoring-es-nodes-0",
      "transport_address" : "1..:9300",
      "node_attributes" : {
        "xpack.installed" : "true",
        "k8s_node_name" : "aks-general-...3r"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[.ds-heartbeat-8.7.1-2023.06.09-000002][0], node[F...pQ], [P], s[STARTED], a[id=O...g], failed_attempts[0]]"
        }
      ]
    }
  ]
}

If I look at the date of "unassigned_info" : { "reason" : "INDEX_CREATED", "at" : "2023-06-09T13:04:03.984Z", "last_allocation_status" : "no_attempt" }, it certainly seems like it's been out of commission for a while now.

According to Pat;

If you check that specific index it probably has the default set for 1 replica (+1 main shard). 
Since dev only has a single node it's not possible to keep a replica of the data (one copy per node)
The solution is to update the index to tell it you don't want extra replicas, which is what was done by default for
other indexes but that one must be newer

Trying to edit what I got from http://localhost:9200/.ds-heartbeat-8.7.1-2023.06.09-000002?pretty by port forwarding kibana as well
image

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Jul 18, 2023

More talking; found this in the docs about reducing replicas.

We might not need to specify .dsXYZ, as in just do

PUT _settings
{
  "index.number_of_replicas": 1
}

After which though we may need to go and delete the extra shards (so in our case this ds stuff, should be 4 of them).

The other thing is we're unsure "if we want to set it to 1 or 0 (is replica an extra copy of the original?"
Given the information here we have

There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard 
is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase 
capacity to serve read requests like searching or retrieving a document

and with what Pat said before, I feel like we can just go down to zero? The shard data that was provided from curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -u jose.matsuda has "primary" : false, There is also this which seems to be more specific to a certain index.

Steps:

  • Set replicas to zero (we're on dev so we can't have a replica of the data anyways)
    • submitted the request (via the kibana dev tools since curl wasn't being nice with me) and got "acknowledged": true
  • Manually delete the extra shards that are not primary, I think there should be 4 (because there are 4 unassigned shards)
    • This wasn't needed

Argocd is now all green, there are no more unassigned_shards either and attempting the previous query of
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -u jose.matsuda yields an error saying that there are no unassigned shards.

jose@w-matsujo-1:~$ curl -X GET "localhost:9200/_cluster/health?filter_path=status,*_shards&pretty" -u jose.matsuda
Enter host password for user 'jose.matsuda':
{
  "status" : "green",
  "active_primary_shards" : 117,
  "active_shards" : 117,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0
}

@Jose-Matsuda
Copy link
Contributor

Jose-Matsuda commented Jul 19, 2023

Regarding spark-operator and its yellow status, we are thinking of just removing it (at least when installed using kubeflow) outright.

This is because kubeflow removed it from their stack and will not exist in future releases.

We can instead get a standalone install like what has been done for cloud main.

I will close this ticket when we have a team discussion just to confirm that it is ok / create a new issue for spark

Just had to delete the application from argocd as well and its gone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants