KF1.7 Applications not Syncing #1762

wg102 · 2023-07-10T16:15:40Z

Kserve stuck in Progressing" -- solved

Original ticket were the issue was found : #1752
Getting imagepullbackoff, probably because gatekeeper is blocking image: 'kserve/kserve-controller:v0.10.0'... Having said that the error does say, "Back-off restarting failed container"
Getting: {"level":"error","ts":1688561752.6363223,"logger":"entrypoint","msg":"unable to get ingress config.","error":"unable to parse ingress config json: invalid character '}' looking for beginning of object key string","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:250"} from the manager container
We are currently using kserve/kserve-controller:v0.8.0 in prod, hence I don't think the gatekeeper is the reason it doesn't work

Looking at the pods log we see

{"level":"error","ts":1688561752.6363223,"logger":"entrypoint","msg":"unable to get ingress config.","error":"unable to parse ingress config json: invalid character '}' looking for beginning of object key string","stacktrace":"runtime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

According to an article on stackoverflow, this would be most likely malformed json file.

The other application is

Monitoring-system stuck at Degraded Health

Monitoring-system one became degraded. It's for the ElasticSearch reousrce in monitoring-system and it's because the Health/Status is yellow, have we encountered this before / how did we fix it? (I can't seem to find any issues in our repos that mention it aside from our new one)

The text was updated successfully, but these errors were encountered:

Jose-Matsuda · 2023-07-10T18:29:33Z

With monitoring-system stuck at degraded health, if you click on the app that is degraded it says "ElasticSearch Cluster status is yellow, check the status of indices, replicas and shards" so while yes it's a problem. I'm not sure if its 100% related to the upgrade.

I can't remember if it was degraded before / when we first upgraded it, I didnt think it was but I may be mistaken.

Maybe this case happened where during some cluster operation it kinda dies.

wg102 · 2023-07-11T12:43:02Z

K Serve stuck at progressing:

The issue was indeed a malformed file. the ingress has one comma too many. Once that was edited in the live manifest, the application is out of sync (since there was a manual edit) but Healthy

Jose-Matsuda · 2023-07-18T12:06:40Z

Monitoring System Investigation

Port-forwarding elastic-monitoring-es-nodes-0 to 9200 and using

curl -X GET "localhost:9200/_cluster/health?filter_path=status,*_shards&pretty" -u jose.matsuda
I get

{
  "status" : "yellow",
  "active_primary_shards" : 117,
  "active_shards" : 117,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0
}

I'm trying to use the cURL described here but I don't seem to have access?
Using this one at the bottom of curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -u jose.matsuda

provides

{
  "note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "index" : ".ds-heartbeat-8.7.1-2023.06.09-000002",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2023-06-09T13:04:03.984Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "Elasticsearch isn't allowed to allocate this shard to any of the nodes in the cluster. Choose a node to which you expect this shard to be allocated, find this node in the node-by-node explanation, and address the reasons which prevent Elasticsearch from allocating this shard there.",
  "node_allocation_decisions" : [
    {
      "node_id" : "F...Q",
      "node_name" : "elastic-monitoring-es-nodes-0",
      "transport_address" : "1..:9300",
      "node_attributes" : {
        "xpack.installed" : "true",
        "k8s_node_name" : "aks-general-...3r"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[.ds-heartbeat-8.7.1-2023.06.09-000002][0], node[F...pQ], [P], s[STARTED], a[id=O...g], failed_attempts[0]]"
        }
      ]
    }
  ]
}

If I look at the date of "unassigned_info" : { "reason" : "INDEX_CREATED", "at" : "2023-06-09T13:04:03.984Z", "last_allocation_status" : "no_attempt" }, it certainly seems like it's been out of commission for a while now.

According to Pat;

If you check that specific index it probably has the default set for 1 replica (+1 main shard). 
Since dev only has a single node it's not possible to keep a replica of the data (one copy per node)
The solution is to update the index to tell it you don't want extra replicas, which is what was done by default for
other indexes but that one must be newer

Trying to edit what I got from http://localhost:9200/.ds-heartbeat-8.7.1-2023.06.09-000002?pretty by port forwarding kibana as well

Jose-Matsuda · 2023-07-18T17:16:18Z

More talking; found this in the docs about reducing replicas.

We might not need to specify .dsXYZ, as in just do

PUT _settings
{
  "index.number_of_replicas": 1
}

After which though we may need to go and delete the extra shards (so in our case this ds stuff, should be 4 of them).

The other thing is we're unsure "if we want to set it to 1 or 0 (is replica an extra copy of the original?"
Given the information here we have

There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard 
is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase 
capacity to serve read requests like searching or retrieving a document

and with what Pat said before, I feel like we can just go down to zero? The shard data that was provided from curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -u jose.matsuda has "primary" : false, There is also this which seems to be more specific to a certain index.

Steps:

Set replicas to zero (we're on dev so we can't have a replica of the data anyways)
- submitted the request (via the kibana dev tools since curl wasn't being nice with me) and got "acknowledged": true
Manually delete the extra shards that are not primary, I think there should be 4 (because there are 4 unassigned shards)
- This wasn't needed

Argocd is now all green, there are no more unassigned_shards either and attempting the previous query of
curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -u jose.matsuda yields an error saying that there are no unassigned shards.

jose@w-matsujo-1:~$ curl -X GET "localhost:9200/_cluster/health?filter_path=status,*_shards&pretty" -u jose.matsuda
Enter host password for user 'jose.matsuda':
{
  "status" : "green",
  "active_primary_shards" : 117,
  "active_shards" : 117,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0
}

Jose-Matsuda · 2023-07-19T14:17:00Z

Regarding spark-operator and its yellow status, we are thinking of just removing it (at least when installed using kubeflow) outright.

This is because kubeflow removed it from their stack and will not exist in future releases.

We can instead get a standalone install like what has been done for cloud main.

I will close this ticket when we have a team discussion just to confirm that it is ok / create a new issue for spark

Just had to delete the application from argocd as well and its gone

wg102 mentioned this issue Jul 11, 2023

fix: Remove comma for configfile StatCan/aaw-kubeflow-manifests#367

Merged

wg102 added the triage/support label Jul 11, 2023

chuckbelisle assigned wg102 Jul 12, 2023

wg102 assigned Jose-Matsuda Jul 19, 2023

Jose-Matsuda mentioned this issue Jul 19, 2023

chore(remove spark) StatCan/aaw-kubeflow-manifests#373

Merged

Jose-Matsuda linked a pull request Jul 21, 2023 that will close this issue

chore(remove spark) StatCan/aaw-kubeflow-manifests#373

Merged

Jose-Matsuda closed this as completed in StatCan/aaw-kubeflow-manifests#373 Jul 21, 2023

wg102 mentioned this issue Aug 8, 2023

Update Manifest for Prod #1793

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KF1.7 Applications not Syncing #1762

KF1.7 Applications not Syncing #1762

wg102 commented Jul 10, 2023 •

edited

Loading

Jose-Matsuda commented Jul 10, 2023 •

edited

Loading

wg102 commented Jul 11, 2023

Jose-Matsuda commented Jul 18, 2023 •

edited

Loading

Jose-Matsuda commented Jul 18, 2023 •

edited

Loading

Jose-Matsuda commented Jul 19, 2023 •

edited

Loading

KF1.7 Applications not Syncing #1762

KF1.7 Applications not Syncing #1762

Comments

wg102 commented Jul 10, 2023 • edited Loading

Kserve stuck in Progressing" -- solved

Monitoring-system stuck at Degraded Health

Jose-Matsuda commented Jul 10, 2023 • edited Loading

wg102 commented Jul 11, 2023

Jose-Matsuda commented Jul 18, 2023 • edited Loading

Monitoring System Investigation

Jose-Matsuda commented Jul 18, 2023 • edited Loading

Jose-Matsuda commented Jul 19, 2023 • edited Loading

wg102 commented Jul 10, 2023 •

edited

Loading

Jose-Matsuda commented Jul 10, 2023 •

edited

Loading

Jose-Matsuda commented Jul 18, 2023 •

edited

Loading

Jose-Matsuda commented Jul 18, 2023 •

edited

Loading

Jose-Matsuda commented Jul 19, 2023 •

edited

Loading