The kinesis autoscaler can't scale up by less than double #101

moskyb · 2021-01-06T21:18:01Z

So a little while ago, after running into issues using a scale up percentage of less than 100%, I submitted this PR. My understanding was if I had an (abridged) config like:

"scaleUp": {
  "scaleThresholdPct": 75,
  "scaleAfterMins": 1,
  "scalePct": 115 
}

and I had a stream that currently had 100 shards, that the kinesis autoscaler would say "100 shards * 1.15, okay the stream will have 115 shards when I scale up".

As far as I can tell from looking at the code though, that's not actually the case, as this line of code indicates that the autoscaler interprets scalePct: 115 as "Add 115% the shard's current capacity to it's existing capacity". This means that scalePct: 115 on a stream with 100 shards will actually scale the stream up to 215 shards.

The issue here isn't that this is the behaviour, that's totally fine; however, the config parser will throw an error if scaleUpPct is less than 100, meaning that any scale up operation must at least double the capacity of the stream.

I'm happy to go in an modify this in whatever way is necessary - change it so that we can use a scaleUpPct > 100, or change the scaling behaviour, but I'm not sure what the actual expected behaviour is - I'm hoping the maintainers can provide some clarity on this :)

The text was updated successfully, but these errors were encountered:

IanMeyers · 2021-01-07T12:07:45Z

Yes, it appears that this logic has diverged. Very unfortunate - in this case I believe the config parser should be modified as folks will have configurations that rely on the setting of this value. Happy to take a PR for this or can fix sometime next week.

IanMeyers · 2021-01-07T12:19:28Z

Nevermind - fixing it now

IanMeyers · 2021-01-07T13:17:44Z

This should be fixed in 81b6fb8, version .9.8.3

moskyb · 2021-05-05T21:09:04Z

@IanMeyers what's the status of 9.8.3? Is it coming any time soon?

IanMeyers · 2021-05-10T15:44:19Z

.9.8.4 is here: https://github.com/awslabs/amazon-kinesis-scaling-utils/releases/tag/v.9.8.4

rebecca2000 · 2021-07-18T21:57:00Z

@IanMeyers @moskyb I noticed that the autoscaler still fails to scale up by less than double. Excerpt from config:

"scaleUp": {
    "scaleThresholdPct": 80,
    "scaleAfterMins": 1,
    "scalePct": 20,
    "coolOffMins": 5,
},

I expect this to add 20% more capacity to the stream.
Observed behaviour: autoscaler detects it needs to scale up but fails to do so (current shard count = 1)

Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.826 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Records 105.62% at 16/07/2021, 05:08 upon current value of 1056.23 and Stream max of 1000.00
Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.826 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Records performance analysis: 1 high samples, and 0 low samples
Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.826 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric PUT[Records] due to highest utilisation metric value 105.62%
Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.826 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Scaling Votes - GET: DOWN, PUT: UP
Jul 16 05:09:38 ip-172-31-32-32 server: 05:09:38.875 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Next Check Cycle in 60 seconds
Jul 16 05:10:00 ip-172-31-32-32 dhclient[2735]: XMT: Solicit on eth0, interval 116910ms.
Jul 16 05:10:01 ip-172-31-32-32 systemd: Started Session 5 of user root.
Jul 16 05:10:01 ip-172-31-32-32 systemd: Started Session 6 of user root.
Jul 16 05:10:38 ip-172-31-32-32 server: 05:10:38.876 [pool-2-thread-1] INFO  c.a.s.k.s.auto.StreamMetricManager - Requesting 1 minutes of CloudWatch Data for Stream Metric GetRecords.Bytes

If I understand correctly, this line of code should add the current shard count to the scale up percent.

Could someone look into this? Thanks!

IanMeyers · 2021-07-19T11:22:38Z

Yep this looks like it should definitely be scaling up to 2 shards based upon a 105.62% put records threshold. Can you please confirm that you are running version .9.8.4?

rebecca2000 · 2021-07-20T03:21:13Z

Yep I am running that version, taken from the link in README

IanMeyers · 2021-07-20T13:33:35Z

OK - if you could please deploy the .9.8.5 version that's been uploaded into the /dist folder, and then please turn on DEBUG logging (Beanstalk application parameter LOG_LEVEL=DEBUG), we should be able to get more details about why it's deciding not to scale.

rebecca2000 · 2021-07-22T06:54:47Z

Thanks, the reason is "Not requesting a scaling action because new shard count equals current shard count, or new shard count is 0"

Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.320 [pool-2-thread-1] INFO  c.a.s.k.s.auto.StreamMetricManager - Requesting 1 minutes of CloudWatch Data for Stream Metric PutRecords.Records
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.368 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - GET Bytes performance analysis: 0 high samples, and 2 low samples
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.369 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - GET Records performance analysis: 0 high samples, and 2 low samples
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.369 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric GET[Bytes] due to highest utilisation metric value 0.00%
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.369 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Bytes 3.91% at 22/07/2021, 06:03 upon current value of 40986.02 and Stream max of 1048576.00
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.369 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Bytes performance analysis: 0 high samples, and 1 low samples
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.371 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Records 98.60% at 22/07/2021, 06:03 upon current value of 985.95 and Stream max of 1000.00
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.371 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Records performance analysis: 1 high samples, and 0 low samples
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.371 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric PUT[Records] due to highest utilisation metric value 98.60%
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.371 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Scaling Votes - GET: DOWN, PUT: UP
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.421 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Not requesting a scaling action because new shard count equals current shard count, or new shard count is 0
Jul 22 06:04:47 ip-172-31-46-122 server: 06:04:47.421 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Next Check Cycle in 60 seconds
Jul 22 06:05:47 ip-172-31-46-122 server: 06:05:47.421 [pool-2-thread-1] INFO  c.a.s.k.s.auto.StreamMetricManager - Requesting 1 minutes of CloudWatch Data for Stream Metric GetRecords.Bytes

IanMeyers · 2021-07-22T08:25:44Z

Great - can you please turn on DEBUG level logging, and we'll be able to see exactly what the calculation was?

rebecca2000 · 2021-07-23T05:58:45Z

All good, here are the additional logs

Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.490 [pool-2-thread-1] INFO  c.a.s.k.s.auto.StreamMetricManager - Requesting 1 minutes of CloudWatch Data for Stream Metric GetRecords.Records
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.527 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Bytes 4.18% at 23/07/2021, 05:38 upon current value of 43878.37 and Stream max of 1048576.00
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.529 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Bytes performance analysis: 0 high samples, and 1 low samples
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.530 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Utilisation of PUT Records 105.59% at 23/07/2021, 05:38 upon current value of 1055.93 and Stream max of 1000.00
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.531 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - PUT Records performance analysis: 1 high samples, and 0 low samples
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric PUT[Records] due to highest utilisation metric value 105.59%
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - GET Bytes performance analysis: 0 high samples, and 2 low samples
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - GET Records performance analysis: 0 high samples, and 2 low samples
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Will decide scaling action based on metric GET[Bytes] due to highest utilisation metric value 0.00%
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Scaling Votes - GET: DOWN, PUT: UP
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.532 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Determined Scaling Direction UP
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.577 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Current Shard Count: 1
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.578 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Calculated new Target Shard Count of 1
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.578 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Not requesting a scaling action because new shard count equals current shard count, or new shard count is 0
Jul 23 05:39:56 ip-172-31-40-40 server: 05:39:56.579 [pool-2-thread-1] INFO  c.a.s.k.scaling.auto.StreamMonitor - Next Check Cycle in 60 seconds

IanMeyers · 2021-07-23T10:27:01Z

Hello,

So I missed it in your config the first time. Through version .9.8.6, having a scalePct less than 100 doesn't often result in any action being taken with Streams with a very low number of Shards - as we've observed here. However, if you install version .9.8.7, I've now extended this logic which has tripped up customers for ages. So now any scalePct will result in a scaling action being taken, even asking to scale up by 20% on 1 shard, which may mean you are over-provisioned. Also, the way that scaleDown configurations were expressed was really confusing for the same reasons. There is new documentation on this in the README.md, and you can find a set of examples in a unit test for the scaling calculation if you are interested. Please let me know if this meets your expectations?

Thx,

Ian

rebecca2000 · 2021-07-25T10:21:49Z

Thanks Ian, the unit tests are really helpful and the documentation is clear :) One small thing I noticed is that a scale up action will always add at least one shard, while scaling down might not change the shard count (apart from min shardCount = 1 of course). So for this case -

@Test
public void testScaleDownBoundary() {
    assertEquals(9, StreamScalingUtils.getNewShardCount(10, null, 10, ScaleDirection.DOWN));
    assertEquals(9, StreamScalingUtils.getNewShardCount(9, null, 10, ScaleDirection.DOWN));
}

Our stream will never scale below 9, which might not be desirable if the min shard count is eg 5 and the shard count naturally sits in that range.

Anyway, just my 2 cents. Thanks for clarifying the scaling behaviour!

IanMeyers · 2021-07-26T09:09:05Z

Hey there - yes that was intentional. I'd rather we not scale down and leave the stream with ample capacity than over-scale and result in throttling. This could be added as a switch to the overall architecture, but I think it's better to be conservative on scaling down - as you find elsewhere with cool-offs in EC2 etc.

IanMeyers self-assigned this Jan 7, 2021

IanMeyers added a commit that referenced this issue Jan 7, 2021

Fixing Issue #101

81b6fb8

moskyb mentioned this issue Jan 25, 2021

Update readme to explain actual scalePct behaviour #103

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The kinesis autoscaler can't scale up by less than double #101

The kinesis autoscaler can't scale up by less than double #101

moskyb commented Jan 6, 2021 •

edited

Loading

IanMeyers commented Jan 7, 2021

IanMeyers commented Jan 7, 2021

IanMeyers commented Jan 7, 2021 •

edited

Loading

moskyb commented May 5, 2021

IanMeyers commented May 10, 2021

rebecca2000 commented Jul 18, 2021

IanMeyers commented Jul 19, 2021

rebecca2000 commented Jul 20, 2021

IanMeyers commented Jul 20, 2021

rebecca2000 commented Jul 22, 2021 •

edited

Loading

IanMeyers commented Jul 22, 2021

rebecca2000 commented Jul 23, 2021

IanMeyers commented Jul 23, 2021

rebecca2000 commented Jul 25, 2021

IanMeyers commented Jul 26, 2021

The kinesis autoscaler can't scale up by less than double #101

The kinesis autoscaler can't scale up by less than double #101

Comments

moskyb commented Jan 6, 2021 • edited Loading

IanMeyers commented Jan 7, 2021

IanMeyers commented Jan 7, 2021

IanMeyers commented Jan 7, 2021 • edited Loading

moskyb commented May 5, 2021

IanMeyers commented May 10, 2021

rebecca2000 commented Jul 18, 2021

IanMeyers commented Jul 19, 2021

rebecca2000 commented Jul 20, 2021

IanMeyers commented Jul 20, 2021

rebecca2000 commented Jul 22, 2021 • edited Loading

IanMeyers commented Jul 22, 2021

rebecca2000 commented Jul 23, 2021

IanMeyers commented Jul 23, 2021

rebecca2000 commented Jul 25, 2021

IanMeyers commented Jul 26, 2021

moskyb commented Jan 6, 2021 •

edited

Loading

IanMeyers commented Jan 7, 2021 •

edited

Loading

rebecca2000 commented Jul 22, 2021 •

edited

Loading