Skip to content

fix bug in average ring buffer and add negative duration protection to scheduler decision maker #5396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 18, 2023

Conversation

bdoyle0182
Copy link
Contributor

Description

The average ring buffer has an edge case where discarding the max and min outliers from the average calculation can result in the average being negative due to the max and min not ever being reset even when the max and min value is ultimately discarded from the ongoing buffer. So the way it's set up it keeps the absolute max and absolute min duration for this function that has ever occurred. I just got rid of the max and min outlier discarding for now because I don't think it's adding much value and to appropriately update the max and min would require a performance hit to the ring buffer. I've also added negative duration buffer protection to the scheduling decision maker so that if such a case were introduced again in the future, the scheduling decision maker can more gracefully handle it and still fallback to scheduling containers when there are stale activations rather than thinking the amount of containers to schedule is negative.

Here is a raw example of how things can go wrong.

There is an edge case on the (sum - max - min) / (size - 2) calculation. The max never gets reset even if the element leaves the buffer. So say you have a random activation that takes 60s to run. The normal case is that all other activations take 50ms. If the buffer size is 10, then the sum should = 500 assuming the 60000 activation has left the buffer. however the max value will still be set to 60000.

(500 - 60000 - 50) / (10 - 2) = -7443ms

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@codecov-commenter
Copy link

codecov-commenter commented Apr 14, 2023

Codecov Report

Merging #5396 (8f3f04d) into master (4e2dea1) will decrease coverage by 0.99%.
The diff coverage is 62.50%.

❗ Current head 8f3f04d differs from pull request most recent head f05bf00. Consider uploading reports for the commit f05bf00 to get more accurate results

@@            Coverage Diff             @@
##           master    #5396      +/-   ##
==========================================
- Coverage   76.77%   75.78%   -0.99%     
==========================================
  Files         240      240              
  Lines       14600    14596       -4     
  Branches      634      672      +38     
==========================================
- Hits        11209    11062     -147     
- Misses       3391     3534     +143     
Impacted Files Coverage Δ
...rg/apache/openwhisk/common/AverageRingBuffer.scala 28.57% <0.00%> (+1.29%) ⬆️
...core/scheduler/queue/SchedulingDecisionMaker.scala 97.02% <71.42%> (-1.94%) ⬇️

... and 31 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bdoyle0182 bdoyle0182 requested a review from ningyougang April 17, 2023 16:58
@@ -32,18 +32,12 @@ object AverageRingBuffer {
class AverageRingBuffer(private val maxSize: Int) {
private var elements = Vector.empty[Double]
private var sum = 0.0
private var max = 0.0
private var min = 0.0

def nonEmpty: Boolean = elements.nonEmpty

def average: Double = {
val size = elements.size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance size is 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only place this function is called, it does a nonEmpty check first or skips over it. Now debatably that check should also be happening inside this function for sure to prevent the divide by zero; but since it's not an existing issue for the moment I'm going to merge this in to get this fixed.

@ningyougang
Copy link
Contributor

Good catch 👍

@bdoyle0182 bdoyle0182 merged commit de3e0a8 into apache:master Apr 18, 2023
@style95
Copy link
Member

style95 commented May 2, 2023

@bdoyle0182 Thank you for handling this.
It makes sense to me. 👍

@style95
Copy link
Member

style95 commented May 21, 2023

I realized this commit fixed my annoying issue for a long time as well.
@bdoyle0182, thanks again :)

mtt-merz pushed a commit to mtt-merz/openwhisk that referenced this pull request Oct 22, 2023
…ing decision maker (apache#5396)

Co-authored-by: Brendan Doyle <[email protected]>
(cherry picked from commit de3e0a8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants