-
-
Notifications
You must be signed in to change notification settings - Fork 368
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Is there an existing issue for this?
- I have searched the existing issues
YACE version
0.62.1 image from https://hub.docker.com/r/prometheuscommunity/yet-another-cloudwatch-exporter/tags
Config file
Note: this config was last changed about a year ago, and was working fine with Yace 0.61.2
apiVersion: v1alpha1
discovery:
exportedTagsOnMetrics:
AWS/ApplicationELB:
- Name
- Environment
- Family
AWS/ElastiCache:
- Name
- Environment
- Family
AWS/NetworkELB:
- Name
- Environment
- Family
AWS/RDS:
- Name
- Environment
- Family
AWS/Redshift:
- Name
- Environment
- Family
AWS/Kafka:
- Name
- Environment
- Family
jobs:
- type: AWS/RDS
regions: [${AWS_REGION}]
searchTags:
- key: Environment
value: ^(${ENVIRONMENT})$
- key: Family
value: ^(${FAMILY_NAME})$
statistics: [Average]
period: 60
length: 120
delay: 300
metrics:
- name: BurstBalance
- name: CPUCreditBalance
- name: CPUUtilization
- name: DatabaseConnections
- name: DiskQueueDepth
# Different length of query for EBS Byte Balance %
# For some reason, the default length returns 0
# sometimes.
- name: EBSByteBalance%
period: 60
length: 300
- name: FreeableMemory
- name: FreeStorageSpace
- name: MaximumUsedTransactionIDs
- name: NetworkReceiveThroughput
- name: NetworkTransmitThroughput
- name: ReadIOPS
- name: ReadLatency
- name: ReadThroughput
- name: SwapUsage
- name: WriteIOPS
- name: WriteLatency
- name: WriteThroughput
- type: AWS/Redshift
regions: [${AWS_REGION}]
searchTags:
- key: Environment
value: ^(${ENVIRONMENT})$
- key: Family
value: ^(${FAMILY_NAME})$
statistics: [Average]
period: 60
length: 120
delay: 300
metrics:
- name: ReadIOPS
- name: WriteIOPS
# The average number of bytes read from disk per second.
- name: ReadThroughput
# The average number of bytes written to disk per second.
- name: WriteThroughput
- name: ReadLatency
- name: WriteLatency
- name: NetworkReceiveThroughput
- name: NetworkTransmitThroughput
- name: RedshiftManagedStorageTotalCapacity
- name: TotalTableCount
- name: DatabaseConnections
- name: HealthStatus
# The percent of disk space used.
- name: PercentageDiskSpaceUsed
# The disk or storage space used by a schema.
- name: StorageUsed
- name: AutoVacuumSpaceFreed
- name: CPUUtilization
- name: CommitQueueLength
- name: MaintenanceMode
# The average number of queries completed per second.
- name: QueriesCompletedPerSecond
# The average amount of time to complete a query.
- name: QueryDuration
# The number of queries waiting to enter a workload management (WLM) queue.
- name: WLMQueueLength
# The total time queries spent waiting in the workload management (WLM) queue.
- name: WLMQueueWaitTime
# The average number of queries completed per second for a workload management (WLM) queue.
- name: WLMQueriesCompletedPerSecond
# The average length of time to complete a query for a workload management (WLM) queue.
- name: WLMQueryDuration
# The number of queries running from both the main cluster and concurrency scaling cluster per WLM queue.
- name: WLMRunningQueries
- type: AWS/ApplicationELB
regions: [${AWS_REGION}]
searchTags:
- key: Environment
# We have uat specific alb in staging account
value: (${ENVIRONMENT}%{if ENVIRONMENT == "staging"}|uat%{endif})
- key: Family
value: ^(${FAMILY_NAME})$
statistics: [Sum]
period: 60
length: 120
delay: 300
metrics:
- name: TargetResponseTime
nilToZero: true
statistics: [Average]
- name: RequestCount
nilToZero: true
- name: HTTPCode_Target_5XX_Count
nilToZero: true
- name: HTTPCode_Target_4XX_Count
nilToZero: true
- name: HTTPCode_Target_3XX_Count
nilToZero: true
- name: HTTPCode_Target_2XX_Count
nilToZero: true
- name: ActiveConnectionCount
nilToZero: true
- name: NewConnectionCount
nilToZero: true
- name: ProcessedBytes
nilToZero: true
- type: AWS/NetworkELB
regions: [${AWS_REGION}]
searchTags:
- key: Environment
# We have uat specific nlb in staging account
value: (${ENVIRONMENT}%{if ENVIRONMENT == "staging"}|uat%{endif})
- key: Family
value: ^(${FAMILY_NAME})$
statistics: [Sum]
period: 60
length: 120
delay: 300
metrics:
- name: ActiveFlowCount
nilToZero: true
- name: ActiveFlowCount_TCP
nilToZero: true
- name: ActiveFlowCount_TLS
nilToZero: true
- name: ClientTLSNegotiationErrorCount
nilToZero: true
- name: NewFlowCount
nilToZero: true
- name: NewFlowCount_TCP
nilToZero: true
- name: NewFlowCount_TLS
nilToZero: true
- name: ProcessedBytes
nilToZero: true
- name: ProcessedBytes_TCP
nilToZero: true
- name: ProcessedBytes_TLS
nilToZero: true
- name: ProcessedPackets
nilToZero: true
- name: TCP_Client_Reset_count
nilToZero: true
- name: TCP_ELB_Reset_Count
nilToZero: true
- name: TCP_Target_Reset_Count
nilToZero: true
- type: AWS/Kafka
regions: [${AWS_REGION}]
searchTags:
- key: Environment
# We have uat specific nlb in staging account
value: (${ENVIRONMENT}%{if ENVIRONMENT == "staging"}|uat%{endif})
- key: Family
value: ^(${FAMILY_NAME})$
statistics: [Sum]
period: 60
length: 120
delay: 300
# Doc: https://docs.aws.amazon.com/msk/latest/developerguide/metrics-details.html
metrics:
# The percentage of the root disk used by the broker.
- name: RootDiskUsed
nilToZero: true
# The percentage of disk space used for data logs.
- name: KafkaDataLogsDiskUsed
nilToZero: true
# The number of active authenticated, unauthenticated, and inter-broker connections.
- name: ConnectionCount
nilToZero: true
# This metric can help you monitor CPU credit balance on the brokers.
- name: CPUCreditBalance
nilToZero: true
# Total number of topics across all brokers in the cluster.
- name: GlobalTopicCount
nilToZero: true
# The total number of topic partitions per broker, including replicas.
- name: PartitionCount
nilToZero: true
# The number of under-replicated partitions for the broker.
- name: UnderReplicatedPartitions
nilToZero: true
# Total number of partitions that are offline in the cluster.
- name: OfflinePartitionsCount
nilToZero: true
# The number of incoming messages per second for the broker.
- name: MessagesInPerSec
nilToZero: true
# The average time in milliseconds spent in broker network and I/O threads to process requests.
- name: RequestTime
nilToZero: true
# For Producers
# The mean produce time in milliseconds.
- name: ProduceTotalTimeMsMean
nilToZero: true
# For consumers
# The aggregated offset lag for all the partitions in a topic.
- name: SumOffsetLag
nilToZero: true
# For brokers
# https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html#bestpractices-monitor-memory
- name: HeapMemoryAfterGC
nilToZero: true
# https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/#metric-to-watch-page-cache-read-ratio
- name: MemoryCached
nilToZero: true
# The size in bytes of memory that is free and available for the broker, if MemoryCached is high and MemoryFree is low, then the broker is using memory effectively.
- name: MemoryFree
nilToZero: true
# The size in bytes of memory that is in use for the broker.
- name: MemoryUsed
nilToZero: true
# The size in bytes of swap memory that is in use for the broker.
- name: SwapUsed
nilToZero: true
# The In-Sync Replication (ISR) count indicates the set of replicas up-to-date with the leader. The expected value for UnderMinIsrPartitionCount is zero.
- name: UnderMinIsrPartitionCount
nilToZero: true
# The percentage of CPU in user space used by the broker
- name: CpuUser
nilToZero: true
# The number of bytes per second received from clients. This metric is available per broker and also per topic.
- name: BytesInPerSec
nilToZero: true
# The number of bytes per second sent to clients. This metric is available per broker and also per topic.
- name: BytesOutPerSec
nilToZero: true
# indicates the number of packets shaped (dropped or queued) due to exceeding network allocations.
- name: TrafficShaping
nilToZero: true
## Advanced (paid) metrics
# The number of messages in the throttle queue.
- name: ProduceThrottleQueueSize
nilToZero: true
- name: RequestThrottleQueueSize
nilToZero: true
# The number of read and write operations in a specified time period
- name: VolumeReadOps
nilToZero: true
- name: VolumeWriteOps
nilToZero: true
- type: AWS/ElastiCache
regions: [${AWS_REGION}]
searchTags:
- key: Environment
# We have uat specific elasticache in staging account
value: (${ENVIRONMENT}%{if ENVIRONMENT == "staging"}|uat%{endif})
- key: Family
value: ^(${FAMILY_NAME})$
statistics: [Average]
period: 60
length: 120
delay: 300
metrics:
- name: CacheHitRate
nilToZero: true
- name: CacheHits
- name: CacheMisses
- name: CPUCreditBalance
- name: CPUUtilization
- name: CurrConnections
# Total number of keys in all databases
- name: CurrItems
# Total number of keys in all databases that have a ttl set
- name: CurrVolatileItems
# The number of keys that have been evicted due to the maxmemory limit
- name: Evictions
# Indicates whether the node is the primary node of current shard/cluster. 1 = primary, 0 = not primary
- name: IsMaster
# Sometimes, Cloudwatch doesn't return any value. We can't get Yace to turn this into a zero,
# otherwise our alerting con mistakenly consider it as a change of master node (failover).
nilToZero: false
- name: DatabaseMemoryUsagePercentage
- name: EngineCPUUtilization
- name: Evictions
- name: NetworkBytesIn
- name: NetworkBytesOut
# Number of packets queued or dropped because the outbound bandwidth exceeded the maximum for the instance
- name: NetworkBandwidthOutAllowanceExceeded
# Number of packets dropped because connection tracking exceeded the maximum for the instance
- name: NetworkConntrackAllowanceExceeded
# Number of packets queued or dropped because the bidirectional packets/s exceeded the maximum for the instance
- name: NetworkPacketsPerSecondAllowanceExceeded
# The total number of connections that have been accepted by the server during this period
- name: NewConnections
# How far behind the replica is in applying changes from the primary node
- name: ReplicationLag
# Redis commands metrics
- name: GetTypeCmds
nilToZero: true
- name: SetTypeCmds
nilToZero: true
- name: HashBasedCmds
nilToZero: true
- name: KeyBasedCmds
nilToZero: true
- name: NonKeyTypeCmds
nilToZero: true
- name: SetBasedCmds
nilToZero: true
- name: SortedSetBasedCmds
nilToZero: true
- name: StringBasedCmds
nilToZero: true
- name: JsonBasedGetCmds
nilToZero: true
- name: JsonBasedSetCmds
nilToZero: true
- name: ListBasedCmds
nilToZero: true
- name: PubSubBasedCmds
nilToZero: true
- name: EvalBasedCmds
nilToZero: true
# Latency metrics per command type
- name: GetTypeCmdsLatency
nilToZero: true
- name: SetTypeCmdsLatency
nilToZero: true
- name: HashBasedCmdsLatency
nilToZero: true
- name: KeyBasedCmdsLatency
nilToZero: true
- name: NonKeyTypeCmdsLatency
nilToZero: true
- name: SetBasedCmdsLatency
nilToZero: true
- name: SortedSetBasedCmdsLatency
nilToZero: true
- name: StringBasedCmdsLatency
nilToZero: true
- name: JsonBasedCmdsLatency
nilToZero: true
- name: JsonBasedGetCmdsLatency
nilToZero: true
- name: JsonBasedSetCmdsLatency
nilToZero: true
- name: ListBasedCmdsLatency
nilToZero: true
- name: PubSubBasedCmdsLatency
nilToZero: true
- name: EvalBasedCmdsLatency
nilToZero: true
static:
%{~ if length(EB_INFO) > 0 ~}
%{~ for eb in EB_INFO ~}
- namespace: AWS/Events
name: aws_eventbridge
regions:
- ${AWS_REGION}
dimensions:
- name: EventBusName
value: ${eb.event_bus_name}
- name: RuleName
value: ${eb.rule_name}
customTags:
- key: Environment
value: ${ENVIRONMENT}
- key: Family
value: ${FAMILY_NAME}
metrics:
- name: DeadLetterInvocations
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
- name: Events
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
- name: FailedInvocations
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
- name: IngestionToInvocationStartLatency
statistics: [p50, p90, p99]
period: 60
length: 120
delay: 300
nilToZero: true
- name: Invocations
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
- name: InvocationsFailedToBeSentToDlq
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
- name: InvocationsSentToDlq
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
- name: MatchedEvents
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
- name: ThrottledRules
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
- name: TriggeredRules
statistics: [Sum]
period: 60
length: 120
delay: 300
nilToZero: true
%{~ endfor ~}
%{~ endif ~}
%{~ if LOG_GROUP_NAME != "" ~}
- namespace: AWS/Logs
name: aws_cloudwatch_no_logs
regions:
- ${AWS_REGION}
dimensions:
- name: LogGroupName
value: ${LOG_GROUP_NAME}
customTags:
- key: Environment
value: ${ENVIRONMENT}
- key: Family
value: ${FAMILY_NAME}
metrics:
- name: IncomingLogEvents
statistics:
- Sum
period: 60
length: 120
%{~ endif ~}
Current Behavior
After trying to update to 0.62.1 from 0.61.2, Yace crashes immediately after starting with the following error:
{"time":"2025-06-19T02:27:48.971833727Z","level":"INFO","source":"main.go:344","msg":"Yace startup completed","version":"custom-build","version":"custom-build","feature_flags":""}
{"time":"2025-06-19T02:27:51.47698747Z","level":"ERROR","source":"discovery.go:52","msg":"No tagged resources made it through filtering","version":"custom-build","job_type":"AWS/Redshift","region":"ap-southeast-2","arn":"","account":"180570210447","err":"expected to discover resources but none were found"}
panic: runtime error: index out of range [0] with length 0
goroutine 436 [running]:
github.com/prometheus-community/yet-another-cloudwatch-exporter/pkg/clients/cloudwatch/v1.createGetMetricStatisticsInput({0xc000b6da40, 0x2, 0x2}, 0xc000781f70, 0xc00063e960, 0xc0008063d0)
/app/pkg/clients/cloudwatch/v1/input.go:74 +0x997
github.com/prometheus-community/yet-another-cloudwatch-exporter/pkg/clients/cloudwatch/v1.client.GetMetricStatistics({0xc0004c7800?, {0x42e37f0?, 0xc00035e068?}}, {0x42c1648, 0xc0007066c0}, 0xc0008063d0, {0xc000b6da40, 0x2, 0x2}, {0xc0004fee80, ...}, ...)
/app/pkg/clients/cloudwatch/v1/client.go:157 +0xf1
github.com/prometheus-community/yet-another-cloudwatch-exporter/pkg/clients/cloudwatch.limitedConcurrencyClient.GetMetricStatistics({{0x42c09b0?, 0xc000010a38?}, {0x42880e0?, 0xc00007a600?}}, {0x42c1648, 0xc0007066c0}, 0xc0008063d0, {0xc000b6da40, 0x2, 0x2}, ...)
/app/pkg/clients/cloudwatch/client.go:78 +0xea
github.com/prometheus-community/yet-another-cloudwatch-exporter/pkg/job.runStaticJob.func1()
/app/pkg/job/static.go:56 +0x39e
created by github.com/prometheus-community/yet-another-cloudwatch-exporter/pkg/job.runStaticJob in goroutine 55
/app/pkg/job/static.go:37 +0x125
Expected Behavior
Noticing the error message about Redshift discovery just before the crash, indeed there is no Redshift cluster to discover in the current AWS account where we were running our tests with the new version of Yace.
However, if that's what is causing the panic, Yace should preferably skip it and move on (like it seemingly used to do in the previous releases).
Steps To Reproduce
No response
Anything else?
YACE is a great tool and we love it!
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working