Skip to content

feat: 8841: Need a way to change the number of object key buckets in MerkleDb #18910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

artemananiev
Copy link
Contributor

@artemananiev artemananiev commented Apr 18, 2025

Fix summary:

  • A new method is introduced to HalfDiskHashMap: resizeIfNeeded()
  • This method is called after every flush to MerkleDb data source. It checks if the average bucket entry count in the HDHM exceeds a threshold, and doubles the number of buckets if needed
  • A new method is introduced to Bucket: sanitize()
  • This method checks bucket's index and also removes all stale entries - entries that were valid for this bucket before HDHM resize, but now are in a different bucket
  • Changed the way how MerkleDb path and bucket index sizes are calculated, when a new MerkleDb instance is created
  • Many VirtualMap size hints in Hedera services are updated to fit better into this new way
  • New unit tests are introduced to cover HDHM.resizeIfNeeded()

Fixes: #8841
Signed-off-by: Artem Ananev [email protected]

@artemananiev artemananiev added Platform Virtual Map Platform Data Structures Platform Tickets pertaining to the platform Scalability Issues related to the scalability problems labels Apr 18, 2025
@artemananiev artemananiev self-assigned this Apr 18, 2025
@jasperpotts
Copy link
Contributor

I think I get the idea. I was puzzled how you doubled the size of index with concurrent access but I think I understand index expansion only happens on startup? And because it is fast it should not be a big impact on start-up time. I wonder if there is a way to expand while reading from disk without being much slower. Rather than needing enough RAM for new and old indexes.

It would be really nice to key this off heuristics so that we record amount of buckets that have overflowed or number of buckets that have reached >= 90% full. Then if more then 20% of buckets have passed that threshold we double map size on next restart. That way there is no operations monitoring and config changes needed to maintain size. This would be great for non-mainnet networks as they could be configured with minimal sizes and expand as needed, as long as they are restarted. We could add atomic reference to index and do compare and swap to allow for live index doubling but not sure if that is too performance expensive.

@jasperpotts
Copy link
Contributor

I think we should strive for goal of automatic management, minimum operations effort needed. Or what Oracle would call "Autonomous" 😀

@artemananiev
Copy link
Contributor Author

It would be really nice to key this off heuristics so that we record amount of buckets that have overflowed or number of buckets that have reached >= 90% full. Then if more then 20% of buckets have passed that threshold we double map size on next restart. That way there is no operations monitoring and config changes needed to maintain size. This would be great for non-mainnet networks as they could be configured with minimal sizes and expand as needed, as long as they are restarted. We could add atomic reference to index and do compare and swap to allow for live index doubling but not sure if that is too performance expensive.

Automatic index resizing on startup looks doable, but I'd like to track it in a separate ticket. We currently don't collect stats on individual buckets, for example, the number of buckets >= 90% full. All we have is total number of entries and current number of buckets. With really bad hashing function, it's possible that the map is only 25% full, but more than 20% buckets have too many entries.

@artemananiev
Copy link
Contributor Author

I think I get the idea. I was puzzled how you doubled the size of index with concurrent access but I think I understand index expansion only happens on startup? And because it is fast it should not be a big impact on start-up time. I wonder if there is a way to expand while reading from disk without being much slower. Rather than needing enough RAM for new and old indexes.

Yes, the idea is to resize at startup only. There will be no single point in time, when both old and new indexes are in memory. HDHM should detect that a resize is needed before its index is loaded from a snapshot.

Resizing on the fly is more complicated, but doesn't provide any benefits. When a bucket is loaded from disk (e.g. for update during flush), and it isn't found - is it because it's an empty bucket, or because its entries are still in the N-1-bit "old" bucket? Same checks need to be done on reads, too. When a bucket is finally updated, its data location in the index is set, and according to our current implementation the whole index up to the point is allocated in memory anyway.

Copy link

codacy-production bot commented May 15, 2025

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
+0.03% (target: -1.00%) 84.02%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (d41653d) 100615 73744 73.29%
Head commit (4844b43) 100742 (+127) 73865 (+121) 73.32% (+0.03%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#18910) 169 142 84.02%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Copy link

codecov bot commented May 15, 2025

Codecov Report

Attention: Patch coverage is 75.14793% with 42 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...wirlds/merkledb/files/hashmap/HalfDiskHashMap.java 81.08% 11 Missing and 3 partials ⚠️
...ava/com/swirlds/merkledb/files/hashmap/Bucket.java 70.45% 9 Missing and 4 partials ⚠️
...java/com/swirlds/merkledb/MerkleDbTableConfig.java 71.42% 2 Missing and 4 partials ⚠️
...wirlds/merkledb/MerkleDbCompactionCoordinator.java 40.00% 1 Missing and 2 partials ⚠️
.../java/com/swirlds/merkledb/MerkleDbDataSource.java 75.00% 1 Missing and 1 partial ⚠️
...ava/com/swirlds/merkledb/files/DataFileCommon.java 0.00% 2 Missing ⚠️
...rlds/demo/migration/MigrationTestingToolState.java 0.00% 1 Missing ⚠️
...m/swirlds/merkledb/files/hashmap/ParsedBucket.java 90.00% 0 Missing and 1 partial ⚠️

Impacted file tree graph

@@             Coverage Diff              @@
##               main   #18910      +/-   ##
============================================
+ Coverage     69.34%   69.38%   +0.03%     
- Complexity    23260    23289      +29     
============================================
  Files          2645     2644       -1     
  Lines        100665   100792     +127     
  Branches      10399    10425      +26     
============================================
+ Hits          69811    69932     +121     
- Misses        26891    26897       +6     
  Partials       3963     3963              
Files with missing lines Coverage Δ Complexity Δ
...hedera/node/app/hints/schemas/V059HintsSchema.java 100.00% <ø> (ø) 4.00 <0.00> (ø)
...hedera/node/app/hints/schemas/V060HintsSchema.java 84.21% <ø> (ø) 8.00 <0.00> (ø)
...e/consensus/impl/schemas/V0490ConsensusSchema.java 100.00% <ø> (ø) 4.00 <0.00> (ø)
...app/service/file/impl/schemas/V0490FileSchema.java 86.54% <ø> (+1.46%) 56.00 <0.00> (+2.00)
...ice/schedule/impl/schemas/V0490ScheduleSchema.java 100.00% <ø> (ø) 7.00 <0.00> (ø)
...ice/schedule/impl/schemas/V0570ScheduleSchema.java 100.00% <ø> (ø) 8.00 <0.00> (ø)
...ice/contract/impl/schemas/V0490ContractSchema.java 100.00% <ø> (ø) 6.00 <0.00> (ø)
...p/service/token/impl/schemas/V0490TokenSchema.java 95.45% <ø> (ø) 4.00 <0.00> (ø)
...p/service/token/impl/schemas/V0530TokenSchema.java 100.00% <ø> (ø) 3.00 <0.00> (ø)
...va/com/swirlds/merkledb/config/MerkleDbConfig.java 92.30% <ø> (ø) 5.00 <0.00> (ø)
... and 11 more

... and 48 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@artemananiev artemananiev requested review from akugal, thenswan and anthony-swirldslabs and removed request for jasperpotts May 17, 2025 00:11
@artemananiev artemananiev added this to the v0.63 milestone May 17, 2025
@artemananiev artemananiev changed the title draft: 8841: Need a way to change the number of object key buckets in MerkleDb feat: 8841: Need a way to change the number of object key buckets in MerkleDb May 17, 2025
@artemananiev artemananiev marked this pull request as ready for review May 17, 2025 00:12
@artemananiev artemananiev requested review from a team and tinker-michaelj as code owners May 17, 2025 00:12
*/
public void resizeIfNeeded(final long firstLeafPath, final long lastLeafPath) {
final long currentSize = lastLeafPath - firstLeafPath + 1;
if (currentSize / numOfBuckets.get() <= goodAverageBucketEntryCount * 80 / 100) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formula should be discussed, I'm open to any feedback. Here is how it's currently coded. Assume VM size hint was originally 1M. With default configs, it would result in 32K buckets created. Once VM actually grows to 80% of the original size hint, that is to 800K, HDHM will double the number of buckets to 64K. When VM grows to 1600K, the number of buckets will grow to 128K. And so on

@@ -106,7 +106,7 @@ private static final class ClassVersion {
/**
* Max number of keys that can be stored in a table.
*/
private long maxNumberOfKeys = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long time ago, this was just a hint used to calculate the (unchangeable) number of HDHM bucket. Then this hint was changed to a hard limit. Now it's a hint again, and I'm renaming it to "initialCapacity" (similar to HashMap) to reflect that

private static final long MAX_TOKENS = 1_000_000L;
private static final long MAX_ACCOUNTS = 1_000_000L;
private static final long MAX_TOKEN_RELS = 1_000_000L;
private static final long MAX_MINTABLE_NFTS = 1_000_000L;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a node starts from an existing state snapshot, these hints are ignored. All mainnet (and testnet, too) nodes have snapshots with 32M key buckets in each of these four large tables, which corresponds to 1B hint. This will not change.

For new runs (unit tests, benchmarks, HashSphere, etc.) nodes will start with 32K key buckets and grow this number as needed as entries are added to virtual maps.

@@ -70,8 +70,7 @@
*/
@ConfigData("merkleDb")
public record MerkleDbConfig(
@Positive @ConfigProperty(defaultValue = "500000000") long maxNumOfKeys,
@Positive @ConfigProperty(defaultValue = "" + 4_000_000_000L) long size,
@Positive @ConfigProperty(defaultValue = "4000000000") long maxNumOfKeys,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxNumOfKeys previously was not "max", it was just some default value used as virtual map size hint in tests and benchmarks. I've no ideas where the old value, 500_000_000, came from.

Now this config is used as a real hard limit for the number of entries that can be stored in a single MerkleDb table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Platform Data Structures Platform Virtual Map Platform Tickets pertaining to the platform Scalability Issues related to the scalability problems
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need a way to change the number of object key buckets in MerkleDb
2 participants