-
Notifications
You must be signed in to change notification settings - Fork 239
Changes to support Ensemble Top Level Response Caching #560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// This function sets composing model server stats to 0 in case of a cache hit | ||
// when top level response cache is enabled, since composing models are not | ||
// executed and do not have any stats | ||
void ResetServerStats(ServerSideStats* server_stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used? If not let's remove
LGTM! Thanks for working on this, Harshini! |
…ton-inference-server/client into lkomali-dlis-4626-pa-changes
<< "cache hit/miss " | ||
<< ensemble_times.total_combined_cache_compute_time_avg_us | ||
<< " usec)" << std::endl; | ||
// FIXME - Refactor these calculations in case of ensemble top level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually good practice to include the ticket number if you have one for follow-up like
// FIXME [DLIS-XXXX]: Fix the calculations for the latency breakdown with ensemble+caching
// This is due to the scheduler sends cache response and composing models do | ||
// not get executed. It's a valid scenario and shouldn't throw error. | ||
bool is_model_version_specified = | ||
// FIXME - Investigate why composing model version is -1 in case of ensemble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually good practice to include the ticket number if you have one for follow-up like
// FIXME [DLIS-XXXX]: ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with assumption we'll follow-up with the improvements for 24.06, @matthewkotila for final verdict.
Left a couple optional nits about the comments.
* Fix empty response bug * Fix unused variable Fix test Initialize logger to capture logs Add unit test Change to _ instead of removing Check if args.model is not None fix artifact path Support Python 3.8 in GenAI-Perf (#643) Add automation to run unit tests and check code coverage for GenAI-Perf against Python 3.10 (#640) Changes to support Ensemble Top Level Response Caching (#560) Support for fixed number of requests (#633) * first pass. Hardcoded values * Working for concurrency (hardcoded whenever count windows is used for now) * working for req rate as well * Add CLI. Add/fix unit tests * Remove hack. Restore all normal functionality * Refactor thread config into one class. Add more testing * Rename arg to request-count * Fix request rate bug * Update info print * fix corner case * move fixme to a story tag * add assert to avoid corner case * rename variables * self review #1 * copyright changes * add doxygen to functions * Don't allow sweeping over multiple concurrency or request rate with request-count fix test (#637) Support custom artifacts directory and improve default artifacts directory (#636) * Add artifacts dir option and more descriptive profile export filename * Clean up * fix input data path * Add tests * create one to one plot dir for each profile run * change the directory look * add helper method Extend genai perf plots to compare across multiple runs (#635) * Modify PlotManager and plots classes * Support plots for multiple runs -draft * Fix default plot visualization * Remove artifact * Set default compare directory * Support generating parquet files * Remove annotations and fix heatmap * Fix errors * Fix pre-commit * Fix CodeQL warning * Remove unused comments * remove x axis tick label for boxplot * Add logging and label for heatmap subplots * Allow users to adjust width and height * fix grammer --------- Co-authored-by: Hyunjae Woo <[email protected]> Generate plot configurations for plot manager (#632) * Introduce PlotConfig and PlotConfigParser class * Port preprocessing steps and introduce ProfileRunData * Create plot configs for default plots * fix minor bug * Fix comment * Implement parse method in PlotConfigParser * refactor * fix test * Add test * Address feedback * Handle custom endpoint Add more metadata to profile export JSON file (#627) * Add more metadata to profile export data * Fix minor bug * refactor Add compare subcommand (#623) * Move for better visibility * Add compare subparser * Add subcommand compare * Fix test * Add ticket * add --files option and minor fix * Fix tests * Add unit tests * Address feedback * Fix minor error and add section header Revert "Changes to support Ensemble Top Level Response Caching (#560) (#642)" This reverts commit cc6a3b2. Changes to support Ensemble Top Level Response Caching (#560) (#642)
Related PR: triton-inference-server/core#338
Changes:
Updated DetermineStatsModelVersion(), MergeStatistics() functions to handle cache hit scenario when ensemble top request is cached due to which composing models are not executed.
Tests for DetermineStatsModelVersion()