Ruler segfault #4459

adriankostrubiak-tomtom · 2021-09-02T18:08:11Z

Describe the bug
Observed panics due to segmentation faults in the ruler.

To Reproduce
Steps to reproduce the behavior:
Run Cortex 1.10.0 & run ruler

Expected behavior
Ruler should not panic

Environment:

Infrastructure: kubernetes - AKS
Deployment tool: customized yaml manifests

Storage Engine

Blocks
Chunks

Additional Context

We are seeing consistent panics from the ruler, with errors like

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e20df3]

goroutine 14595 [running]:
github.com/cortexproject/cortex/pkg/querier.querier.Select(0x2bdf130, 0xc0047b2820, 0xc0046fc440, 0x2, 0x2, 0x28908e0, 0x2bdeaa0, 0xc00206be60, 0x17ba6ec5d74, 0x17ba7234bf4, ...)
	/__w/cortex/cortex/pkg/querier/querier.go:323 +0x193
github.com/cortexproject/cortex/pkg/querier/lazyquery.LazyQuerier.Select.func1(0xc0020e13e0, 0x2be0da0, 0xc000157c00, 0xc004122900, 0x0, 0xc004122900, 0xa, 0x10)
	/__w/cortex/cortex/pkg/querier/lazyquery/lazyquery.go:52 +0x72
created by github.com/cortexproject/cortex/pkg/querier/lazyquery.LazyQuerier.Select
	/__w/cortex/cortex/pkg/querier/lazyquery/lazyquery.go:51 +0xad

The below is the configuration diff from the defaults, as emitted from the ruler.
Note that I also tried with blocks_storage.bucket_store.index_header_lazy_loading_enabled: false and experienced the same error.

alertmanager:
  enable_api: true
  external_url: https://alertmanager.cluster-monitor.*******.com/alertmanager
  sharding_enabled: true
  sharding_ring:
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-alertmanagers/
      store: etcd
alertmanager_storage:
  s3:
    access_key_id: ******
    bucket_name: cortex-alertmanager
    endpoint: s3.storage.svc.cluster.local:9000
    insecure: true
    secret_access_key: '********'
api:
  response_compression_enabled: true
blocks_storage:
  bucket_store:
    bucket_index:
      enabled: true
    chunks_cache:
      backend: memcached
      memcached:
        addresses: dnssrv+_memcached._tcp.chunks-cache.cluster-monitor-cortex.svc.cluster.local
    index_cache:
      backend: memcached
      memcached:
        addresses: dnssrv+_memcached._tcp.index-cache.cluster-monitor-cortex.svc.cluster.local
    index_header_lazy_loading_enabled: true
    metadata_cache:
      backend: memcached
      bucket_index_content_ttl: 2m0s
      memcached:
        addresses: dnssrv+_memcached._tcp.metadata-cache.cluster-monitor-cortex.svc.cluster.local
      metafile_doesnt_exist_ttl: 2m0s
      tenant_blocks_list_ttl: 2m0s
    sync_interval: 5m0s
  s3:
    access_key_id: *****
    bucket_name: cortex
    endpoint: s3.storage.svc.cluster.local:9000
    insecure: true
    secret_access_key: '********'
  tsdb:
    close_idle_tsdb_timeout: 15m0s
    dir: /var/cortex/tsdb
    max_exemplars: 1000
compactor:
  block_deletion_marks_migration_enabled: false
  cleanup_interval: 5m0s
distributor:
  ha_tracker:
    enable_ha_tracker: true
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-ha-tracker/
      store: etcd
  ring:
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-collectors/
      store: etcd
  shard_by_all_labels: true
frontend:
  grpc_client_config:
    grpc_compression: snappy
  log_queries_longer_than: 1s
  query_stats_enabled: true
frontend_worker:
  frontend_address: query-frontend.cluster-monitor-cortex.svc.cluster.local:9095
  grpc_client_config:
    grpc_compression: snappy
    max_send_msg_size: 33554432
ingester:
  lifecycler:
    availability_zone: westeurope-2
    observe_period: 3s
    ring:
      kvstore:
        etcd:
          endpoints:
          - client.etcd.svc.cluster.local:2379
        prefix: cortex-collectors/
        store: etcd
  walconfig:
    wal_enabled: true
ingester_client:
  grpc_client_config:
    grpc_compression: snappy
limits:
  accept_ha_samples: true
  ingestion_burst_size: 75000
  ingestion_rate: 55000
  max_series_per_metric: 70000
querier:
  at_modifier_enabled: true
  query_store_for_labels_enabled: true
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      memcached:
        expiration: 12h0m0s
      memcached_client:
        addresses: dnssrv+_memcached._tcp.index-cache.cluster-monitor-cortex.svc.cluster.local
  split_queries_by_interval: 24h0m0s
ruler:
  alertmanager_url: http://alertmanager.cluster-monitor-cortex.svc.cluster.local:3100/alertmanager
  enable_api: true
  enable_sharding: true
  external_url: https://alertmanager.cluster-monitor.******.com
  ring:
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-rulers/
      store: etcd
  ruler_client:
    grpc_compression: snappy
ruler_storage:
  s3:
    access_key_id: ********
    bucket_name: cortex-ruler
    endpoint: s3.storage.svc.cluster.local:9000
    insecure: true
    secret_access_key: '********'
server:
  http_listen_port: 3100
  log_level: debug
storage:
  engine: blocks
store_gateway:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      etcd:
        endpoints:
        - client.etcd.svc.cluster.local:2379
      prefix: cortex-collectors/
      store: etcd
    zone_awareness_enabled: true
target: ruler

The text was updated successfully, but these errors were encountered:

alanprot · 2021-09-14T19:00:39Z

We are experiencing the same problem.

Looking at the code, seems that when we are updating a rule group manager for the first time after a restart we try to restore the rules which ends up on this line:

https://github.com/prometheus/prometheus/blob/ffe8c7e26884148207f5f0df06c0430e2d3b7bca/rules/manager.go#L749

Now looking on cortex code we can see:

cortex/pkg/querier/querier.go

Lines 309 to 323 in 523dde1

    
           if (sp == nil || sp.Func == "series") && !q.queryStoreForLabels { 
        
           	// In this case, the query time range has already been validated when the querier has been 
        
           	// created. 
        
           	return q.metadataQuerier.Select(true, sp, matchers...) 
        
           } 
        
           userID, err := tenant.TenantID(ctx) 
        
           if err != nil { 
        
           	return storage.ErrSeriesSet(err) 
        
           } 
        
           // Validate query time range. Even if the time range has already been validated when we created 
        
           // the querier, we need to check it again here because the time range specified in hints may be 
        
           // different. 
        
           startMs, endMs, err := validateQueryTimeRange(ctx, userID, sp.Start, sp.End, q.limits, q.maxQueryIntoFuture)

and specially this line:

cortex/pkg/querier/querier.go

Line 309 in 523dde1

if (sp == nil || sp.Func == "series") && !q.queryStoreForLabels {

Here we can see that is SP is nil (which is the case because of this and we set query_store_for_labels_enabled: true (which is also the case) we will get a nil pointer in this line:

cortex/pkg/querier/querier.go

Line 323 in 523dde1

    
           startMs, endMs, err := validateQueryTimeRange(ctx, userID, sp.Start, sp.End, q.limits, q.maxQueryIntoFuture)

The question now is, what should be the behaviour if the hints (sp) is nil in this case?

This change introduced this behaviour: #3461

alvinlin123 · 2021-09-17T19:43:12Z

The code @alanprot mentioned might have been fixed as part of #4413

alanprot · 2021-09-18T19:48:43Z

It seems that indeed this is not the case anymore!

pracucci · 2021-09-29T17:15:38Z

Fixed by #4413.

Feel free to reopen it if that's not the case.

alanprot mentioned this issue Sep 14, 2021

Using subring on getShardedRules when shuffle sharding #4466

Merged

3 tasks

pracucci closed this as completed Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler segfault #4459

Ruler segfault #4459

adriankostrubiak-tomtom commented Sep 2, 2021

alanprot commented Sep 14, 2021 •

edited

Loading

alvinlin123 commented Sep 17, 2021 •

edited

Loading

alanprot commented Sep 18, 2021

pracucci commented Sep 29, 2021

Ruler segfault #4459

Ruler segfault #4459

Comments

adriankostrubiak-tomtom commented Sep 2, 2021

alanprot commented Sep 14, 2021 • edited Loading

alvinlin123 commented Sep 17, 2021 • edited Loading

alanprot commented Sep 18, 2021

pracucci commented Sep 29, 2021

alanprot commented Sep 14, 2021 •

edited

Loading

alvinlin123 commented Sep 17, 2021 •

edited

Loading