Skip to content

chore(storage): Adjust the tier_compaction limit to reduce the impact on ci #9534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 2, 2023

Conversation

Li0k
Copy link
Contributor

@Li0k Li0k commented Apr 28, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

e2e tests found that a large number of compaction task affected the batch query latency.

It is suspected that the frequent compaction task causes more cache misses, which in turn increases the latency.

I have observed that cg3 accumulates hundreds of sub_levels in the whole batch e2e test, which will affect our read performance. I have observed that cg3 accumulates hundreds of sub_levels in the whole batch e2e test, which will affect our read performance. Based on this information, I found a factor:

  1. before this pr, the triggering of trivial_move was restricted and when the level_size was small, trivial_move was deferred until pick_whole_level. However, after the introduction of the new picker logic, this practice was wrong in the batch e2e test. The test data of batch e2e is not overlapping, and trivial_move cannot be triggered, which will cause tier_compaction pick_multi_level to not be triggered either (too few selectable levels in key_range). Also, the batch e2e test data is too small to trigger the base compaction, resulting in all the data piling up at the sub_level.

Changes to this pr

  1. Increase the value of level_0_overlapping_sub_level_count from 3 to 6
  2. Remove the size limit on trivial_move

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

  • My PR DOES NOT contain user-facing changes.
Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

@Li0k Li0k changed the title Li0k/storage revert compaction config chore(storage): revert compaction config Apr 28, 2023
Copy link
Contributor

@soundOfDestiny soundOfDestiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert PR is welcome

@lmatz
Copy link
Contributor

lmatz commented Apr 28, 2023

e2e tests found that a large number of compaction task affected the batch query latency.

if we revert it, does it affect the performance of streaming jobs?

If it may lead to large performance impact, I suggest we build two images, one with and one without PR, both based on latest main to evaluate the regression.

@codecov
Copy link

codecov bot commented May 2, 2023

Codecov Report

Merging #9534 (64eb7a9) into main (1f04347) will increase coverage by 0.00%.
The diff coverage is 91.83%.

@@           Coverage Diff           @@
##             main    #9534   +/-   ##
=======================================
  Coverage   70.73%   70.73%           
=======================================
  Files        1233     1233           
  Lines      206688   206715   +27     
=======================================
+ Hits       146199   146221   +22     
- Misses      60489    60494    +5     
Flag Coverage Δ
rust 70.73% <91.83%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/meta/src/hummock/compaction_scheduler.rs 44.77% <ø> (ø)
...ta/src/hummock/manager/compaction_group_manager.rs 83.24% <0.00%> (-0.33%) ⬇️
...ummock/compaction/picker/tier_compaction_picker.rs 85.53% <97.43%> (+0.48%) ⬆️
...c/meta/src/hummock/compaction/compaction_config.rs 88.05% <100.00%> (+0.36%) ⬆️
.../compaction/picker/base_level_compaction_picker.rs 84.00% <100.00%> (+0.02%) ⬆️
src/meta/src/hummock/compaction_schedule_policy.rs 93.51% <100.00%> (+0.01%) ⬆️
src/meta/src/hummock/test_utils.rs 95.03% <100.00%> (+0.01%) ⬆️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Li0k Li0k changed the title chore(storage): revert compaction config chore(storage): Adjust the tier_compaction limit to reduce the impact on ci May 2, 2023
@Li0k
Copy link
Contributor Author

Li0k commented May 2, 2023

e2e tests found that a large number of compaction task affected the batch query latency.

if we revert it, does it affect the performance of streaming jobs?

If it may lead to large performance impact, I suggest we build two images, one with and one without PR, both based on latest main to evaluate the regression.

It might, so I try to make as few changes as possible, we need to pay more attention, The failure rate of e2e on main is so high that I prefer merging first to solve the problem

@Li0k Li0k added this pull request to the merge queue May 2, 2023
Merged via the queue into main with commit 3cff0eb May 2, 2023
@Li0k Li0k deleted the li0k/storage_revert_compaction_config branch May 2, 2023 11:18
@Li0k
Copy link
Contributor Author

Li0k commented May 2, 2023

e2e test

  • before

image

image

The e2e test shows that after 8996 the e2e time consumption increases and the timeout causes the failure rate to increase.

reason

I tried to reproduce the problem locally and observe the behavior of the system via grafana.
The number of l0_sub_levels is not as expected, according to grafana, there is only one file per sub_level. The logs show that there is no overlap between them and no trivival_move is triggered

  • image
  • image
  • image

In the current logic, the size of the sst file in trivial_move is limited. However, in the e2e test, the sst has fewer data and hits this condition, so trivial_move cannot be triggered. In the absence of trivial_move and the e2e test data are not overlap, the new picker cannot select enough levels, so it cannot trigger tier_compaction. Due to the small amount of data, base-level compaction cannot be triggered either.

image

TLDR

  1. The selected file is too small to trigger trivial_move
  2. the data in the batch e2e test is not overlap, the non-overlapping-level picker can not pick enough level
  3. the amount of data is too small to trigger base-level compaction

Result

  • before

image

  • after

  • image
  • image
  • image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants