Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler segfault #4459

Closed
1 of 2 tasks
adriankostrubiak-tomtom opened this issue Sep 2, 2021 · 4 comments
Closed
1 of 2 tasks

Ruler segfault #4459

adriankostrubiak-tomtom opened this issue Sep 2, 2021 · 4 comments

Comments

@adriankostrubiak-tomtom
Copy link

Describe the bug
Observed panics due to segmentation faults in the ruler.

To Reproduce
Steps to reproduce the behavior:
Run Cortex 1.10.0 & run ruler

Expected behavior
Ruler should not panic

Environment:

  • Infrastructure: kubernetes - AKS
  • Deployment tool: customized yaml manifests

Storage Engine

  • Blocks
  • Chunks

Additional Context

We are seeing consistent panics from the ruler, with errors like

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e20df3]

goroutine 14595 [running]:
github.com/cortexproject/cortex/pkg/querier.querier.Select(0x2bdf130, 0xc0047b2820, 0xc0046fc440, 0x2, 0x2, 0x28908e0, 0x2bdeaa0, 0xc00206be60, 0x17ba6ec5d74, 0x17ba7234bf4, ...)
	/__w/cortex/cortex/pkg/querier/querier.go:323 +0x193
github.com/cortexproject/cortex/pkg/querier/lazyquery.LazyQuerier.Select.func1(0xc0020e13e0, 0x2be0da0, 0xc000157c00, 0xc004122900, 0x0, 0xc004122900, 0xa, 0x10)
	/__w/cortex/cortex/pkg/querier/lazyquery/lazyquery.go:52 +0x72
created by github.com/cortexproject/cortex/pkg/querier/lazyquery.LazyQuerier.Select
	/__w/cortex/cortex/pkg/querier/lazyquery/lazyquery.go:51 +0xad

The below is the configuration diff from the defaults, as emitted from the ruler.
Note that I also tried with blocks_storage.bucket_store.index_header_lazy_loading_enabled: false and experienced the same error.

alertmanager:
  enable_api: true
  external_url: https://alertmanager.cluster-monitor.*******.com/alertmanager
  sharding_enabled: true
  sharding_ring:
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-alertmanagers/
      store: etcd
alertmanager_storage:
  s3:
    access_key_id: ******
    bucket_name: cortex-alertmanager
    endpoint: s3.storage.svc.cluster.local:9000
    insecure: true
    secret_access_key: '********'
api:
  response_compression_enabled: true
blocks_storage:
  bucket_store:
    bucket_index:
      enabled: true
    chunks_cache:
      backend: memcached
      memcached:
        addresses: dnssrv+_memcached._tcp.chunks-cache.cluster-monitor-cortex.svc.cluster.local
    index_cache:
      backend: memcached
      memcached:
        addresses: dnssrv+_memcached._tcp.index-cache.cluster-monitor-cortex.svc.cluster.local
    index_header_lazy_loading_enabled: true
    metadata_cache:
      backend: memcached
      bucket_index_content_ttl: 2m0s
      memcached:
        addresses: dnssrv+_memcached._tcp.metadata-cache.cluster-monitor-cortex.svc.cluster.local
      metafile_doesnt_exist_ttl: 2m0s
      tenant_blocks_list_ttl: 2m0s
    sync_interval: 5m0s
  s3:
    access_key_id: *****
    bucket_name: cortex
    endpoint: s3.storage.svc.cluster.local:9000
    insecure: true
    secret_access_key: '********'
  tsdb:
    close_idle_tsdb_timeout: 15m0s
    dir: /var/cortex/tsdb
    max_exemplars: 1000
compactor:
  block_deletion_marks_migration_enabled: false
  cleanup_interval: 5m0s
distributor:
  ha_tracker:
    enable_ha_tracker: true
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-ha-tracker/
      store: etcd
  ring:
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-collectors/
      store: etcd
  shard_by_all_labels: true
frontend:
  grpc_client_config:
    grpc_compression: snappy
  log_queries_longer_than: 1s
  query_stats_enabled: true
frontend_worker:
  frontend_address: query-frontend.cluster-monitor-cortex.svc.cluster.local:9095
  grpc_client_config:
    grpc_compression: snappy
    max_send_msg_size: 33554432
ingester:
  lifecycler:
    availability_zone: westeurope-2
    observe_period: 3s
    ring:
      kvstore:
        etcd:
          endpoints:
          - client.etcd.svc.cluster.local:2379
        prefix: cortex-collectors/
        store: etcd
  walconfig:
    wal_enabled: true
ingester_client:
  grpc_client_config:
    grpc_compression: snappy
limits:
  accept_ha_samples: true
  ingestion_burst_size: 75000
  ingestion_rate: 55000
  max_series_per_metric: 70000
querier:
  at_modifier_enabled: true
  query_store_for_labels_enabled: true
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      memcached:
        expiration: 12h0m0s
      memcached_client:
        addresses: dnssrv+_memcached._tcp.index-cache.cluster-monitor-cortex.svc.cluster.local
  split_queries_by_interval: 24h0m0s
ruler:
  alertmanager_url: http://alertmanager.cluster-monitor-cortex.svc.cluster.local:3100/alertmanager
  enable_api: true
  enable_sharding: true
  external_url: https://alertmanager.cluster-monitor.******.com
  ring:
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-rulers/
      store: etcd
  ruler_client:
    grpc_compression: snappy
ruler_storage:
  s3:
    access_key_id: ********
    bucket_name: cortex-ruler
    endpoint: s3.storage.svc.cluster.local:9000
    insecure: true
    secret_access_key: '********'
server:
  http_listen_port: 3100
  log_level: debug
storage:
  engine: blocks
store_gateway:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-collectors/
      store: etcd
    zone_awareness_enabled: true
target: ruler
@alanprot
Copy link
Member

alanprot commented Sep 14, 2021

We are experiencing the same problem.

Looking at the code, seems that when we are updating a rule group manager for the first time after a restart we try to restore the rules which ends up on this line:

https://github.com/prometheus/prometheus/blob/ffe8c7e26884148207f5f0df06c0430e2d3b7bca/rules/manager.go#L749

Now looking on cortex code we can see:

if (sp == nil || sp.Func == "series") && !q.queryStoreForLabels {
// In this case, the query time range has already been validated when the querier has been
// created.
return q.metadataQuerier.Select(true, sp, matchers...)
}
userID, err := tenant.TenantID(ctx)
if err != nil {
return storage.ErrSeriesSet(err)
}
// Validate query time range. Even if the time range has already been validated when we created
// the querier, we need to check it again here because the time range specified in hints may be
// different.
startMs, endMs, err := validateQueryTimeRange(ctx, userID, sp.Start, sp.End, q.limits, q.maxQueryIntoFuture)

and specially this line:

if (sp == nil || sp.Func == "series") && !q.queryStoreForLabels {

Here we can see that is SP is nil (which is the case because of this and we set query_store_for_labels_enabled: true (which is also the case) we will get a nil pointer in this line:

startMs, endMs, err := validateQueryTimeRange(ctx, userID, sp.Start, sp.End, q.limits, q.maxQueryIntoFuture)

The question now is, what should be the behaviour if the hints (sp) is nil in this case?

This change introduced this behaviour: #3461

@alvinlin123
Copy link
Contributor

alvinlin123 commented Sep 17, 2021

The code @alanprot mentioned might have been fixed as part of #4413

@alanprot
Copy link
Member

It seems that indeed this is not the case anymore!

@pracucci
Copy link
Contributor

Fixed by #4413.

Feel free to reopen it if that's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants