-
Notifications
You must be signed in to change notification settings - Fork 633
feat(ctl): add API to detect bottleneck actors #21149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c3bffc5
to
c94b84a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better to decouple the API for fetching the await-tree json, and the API to detect bottleneck actors.
That's because our bottleneck-detecting strategy may not be perfect. So we may wish to fetch the await-tree json from the user cluster first.
Then we can iterate on the bottleneck detecting algorithm, and tweak the tool to suit the await-tree.
Besides that, great work!
c94b84a
to
e9c79a7
Compare
e9c79a7
to
648ac9f
Compare
src/meta/src/dashboard/mod.rs
Outdated
let result = client | ||
.stack_trace(StackTraceRequest::default()) | ||
.await | ||
.map_err(err)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also allow specifying format here in the meta dashboard.
And can do analyze there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be better to decouple the API for fetching the await-tree json, and the API to detect bottleneck actors.
That's because our bottleneck-detecting strategy may not be perfect. So we may wish to fetch the await-tree json from the user cluster first.
@kwannoel read my mind. I choose to write this in risectl is for decoupling dumping and analyzing. But of course we can allow one to specify format through meta's API.
Actually I prefer a js/web based tool to analyze the bottleneck. 🤔 That would be faster to iterate and demonstrate. But it's also fine to have one in risectl. |
Tested with this SQL (release build):
Works well for me:
|
+1 for this, it takes a long time for users to upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Mainly just nit picking.
What about we first get the changes to get JSON format merged (maybe can also cherry-picked). And refine the analyzer part later?
if req.actor_traces_format.is_some() | ||
&& req.actor_traces_format.as_ref().unwrap() == "text" | ||
{ | ||
v.to_string() | ||
} else { | ||
serde_json::to_string(&v).unwrap() | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is different from the comment: "// Allowed values: "json", "text". Default is "text"."
Here it defaults to json
.
Maybe better to define an enum instead of using string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
use crate::cmd_impl::await_tree::tree::TreeView; | ||
use crate::cmd_impl::await_tree::utils::extract_actor_traces; | ||
|
||
pub fn transcribe(path: String) -> anyhow::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that this functionality can be provided by the library itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with that.
/// the bottleneck actor is still yielding output to downstream actors. A typical | ||
/// case is JOIN amplification. So the corresponding actors are actively processing | ||
/// the data but the EPOCH span is blocked. | ||
pub fn is_bottleneck(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, the algorithm looks much easier than I thought initially. So there's no need to traverse between actors (based on the Input
and Output
spans) at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I'm not entirely sure this algorithm works in every situation. However, in most cases where there's only one bottleneck actor, it should perform as expected. I can think of a few edge cases:
- If the bottleneck's span doesn't call
instrument_await
, the bottleneck's tree might not exhibit the typical "slow parent with fast children" signal. - If an MV or the entire graph has multiple fragments acting as bottlenecks, I'm uncertain what the outcome would be.
At present, this tool doesn't completely eliminate manual effort. It definitely requires ongoing optimization and the addition of new rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. I'm uncertain what the outcome would be.
Shall we output all of them, ordering by some sort of confidence or probablity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tend to leave it for future. At present, we don't have too much experience dealing with these edge cases. We can add some new rules when we can summarize some tips for them. For now, if the current bottleneck_detect
return nothing, the on-caller should just use transcribe
to convert the json await tree to text and mannually analyse the text await tree.
src/ctl/Cargo.toml
Outdated
@@ -31,6 +31,7 @@ risingwave_pb = { workspace = true } | |||
risingwave_rpc_client = { workspace = true } | |||
risingwave_storage = { workspace = true } | |||
risingwave_stream = { workspace = true } | |||
rw-diagnose-tools = { git = "https://github.com/risingwavelabs/rw-diagnose-tools" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use github url for now. Need @BugenZhao 's help to publish the repo in crates.io.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others LGTM. Thanks for the work
if req.actor_traces_format.is_some() | ||
&& req.actor_traces_format.as_ref().unwrap() == "text" | ||
{ | ||
v.to_string() | ||
} else { | ||
serde_json::to_string(&v).unwrap() | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not resolved
src/meta/src/dashboard/mod.rs
Outdated
let result = client.stack_trace().await.map_err(err)?; | ||
let result = client | ||
.stack_trace(StackTraceRequest { | ||
actor_traces_format: Some("text".to_owned()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a parameter to the meta dashboard /await_tree
call instead. Can default to text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Example of a request:
curl "http://localhost:5691/api/monitor/await_tree/?format=text" > diagnose.txt
src/meta/src/manager/diagnose.rs
Outdated
&& let Ok(result) = client.stack_trace().await | ||
&& let Ok(result) = client | ||
.stack_trace(StackTraceRequest { | ||
actor_traces_format: Some("json".to_owned()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, please add a parameter to the /diagnose
API instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example:
curl http://localhost:5691/api/monitor/diagnose/\?actor_traces_format\=json > diagnose.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After address @xxchan comments, rest LGTM. Thanks for the work!
ca3a490
to
cb52394
Compare
cb52394
to
7b581ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, Thanks! Should we cherry-pick to earlier releases?
Warning The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
Highest Release Version: 2.3
What's changed and what's your intention?
Add an API in risectl to using await tree to locate the bottleneck in streaming graphs. Currently, the API will first pull for a new await tree dump and analyze on it. Ideally, it should also be able to read from an existing diagnose file or await tree file. Will add this later.
Checklist
Documentation
Release note