Skip to content

Add initial counters for the Dataplane #643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

sergeymatov
Copy link
Contributor

  • Add Prometheus exporter (with configurable port for metrics)
  • Add batch of counters for Dataplane pipelines

@sergeymatov sergeymatov requested a review from a team as a code owner June 26, 2025 16:34
@sergeymatov sergeymatov requested review from qmonnet and removed request for a team June 26, 2025 16:34
@sergeymatov sergeymatov marked this pull request as draft June 26, 2025 16:35
@Fredi-raspall Fredi-raspall self-requested a review June 27, 2025 11:04
@sergeymatov sergeymatov force-pushed the pr/smatov/counters branch 2 times, most recently from 27b0aec to 5c3ab60 Compare June 27, 2025 13:39
@sergeymatov sergeymatov marked this pull request as ready for review June 27, 2025 17:34
@Fredi-raspall Fredi-raspall added the dont-merge Do not merge this Pull Request label Jun 30, 2025
@sergeymatov sergeymatov force-pushed the pr/smatov/counters branch from 5c3ab60 to feefefd Compare July 1, 2025 11:43
Default access is 9090/metrics.

Signed-off-by: Sergey Matov <[email protected]>
@sergeymatov sergeymatov force-pushed the pr/smatov/counters branch from feefefd to 7f87a74 Compare July 1, 2025 11:44
Copy link
Collaborator

@daniel-noland daniel-noland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only major concern is all the naked strings. I think refactor will be hard and this is avoidable if we fix it now

})?;

// Describe core gateway metrics
describe_counter!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we wouldn't have so many hard coded string running around.

I recommend adding a traits along the lines of

pub trait GatewayCounter {
    const NAME: &'static str;
    const DESCRIPTION: &'static str;
}

then implementing that trait for a series of zero sized types.

The problem with the naked strings is that it is tricky to refactor or get an exhaustive list of exactly where these counters are used or updated. The type system can help a lot here.

@@ -35,6 +39,92 @@ fn init_logging() {
.init();
}

fn init_metrics(metrics_port: u16) -> Result<(), Box<dyn std::error::Error>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should get annotated with

#[tracing::instrument(level = "info")]

@@ -65,13 +67,17 @@ impl Egress {
let Some(our_mac) = interface.get_mac() else {
error!("{nfi}: Failed to get mac address of interface {ifname}!");
packet.done(DoneReason::InternalFailure);
counter!("gateway_errors_total", "component" => "egress", "error" => "missing_src_mac")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this strategy is ok for the moment, but in a future implementation we should treat the returned Counter objects as state and not regenerate them on every function call

}

fn interface_ingress<Buf: PacketBufferMut>(
&self,
interface: &Interface,
packet: &mut Packet<Buf>,
) {
let ifname = &interface.name;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you clone interface.name here then you can avoid allocating memory in the follow on invokes of counter!

return;
};
debug!("{nfi}: processing packet to {dst} with vrf {vrfid}");

/* decrement packet TTL -- packet may be done if TTL is exceeded */
Self::decrement_ttl(packet, dst);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the logic, but

  1. I think this change is out of scope here
  2. we don't check that packet is_done because the hop limit is past. Just that the packet is done, so the log can be incorrect. Shouldn't decrement_ttl return an enum for this event?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont-merge Do not merge this Pull Request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants