Skip to content

implement gc data copy #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

JacksonYao287
Copy link
Collaborator

@JacksonYao287 JacksonYao287 commented May 27, 2025

crash recovery is also in this PR

@codecov-commenter
Copy link

codecov-commenter commented May 27, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 66.00567% with 120 lines in your changes missing coverage. Please review.

Project coverage is 62.96%. Comparing base (1746bcc) to head (3bed784).
Report is 84 commits behind head on main.

Files with missing lines Patch % Lines
src/lib/homestore_backend/gc_manager.cpp 63.31% 64 Missing and 20 partials ⚠️
src/lib/homestore_backend/hs_pg_manager.cpp 72.91% 10 Missing and 3 partials ⚠️
src/lib/homestore_backend/hs_shard_manager.cpp 72.41% 6 Missing and 2 partials ⚠️
src/lib/homestore_backend/heap_chunk_selector.cpp 65.00% 7 Missing ⚠️
src/lib/homestore_backend/hs_homeobject.cpp 46.15% 6 Missing and 1 partial ⚠️
src/lib/homestore_backend/hs_blob_manager.cpp 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #303      +/-   ##
==========================================
- Coverage   63.15%   62.96%   -0.19%     
==========================================
  Files          32       35       +3     
  Lines        1900     3786    +1886     
  Branches      204      453     +249     
==========================================
+ Hits         1200     2384    +1184     
- Misses        600     1150     +550     
- Partials      100      252     +152     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JacksonYao287 JacksonYao287 force-pushed the gc-data-copy branch 3 times, most recently from 98b208d to c3cc258 Compare May 28, 2025 13:12
@@ -212,7 +222,7 @@ GCManager::pdev_gc_actor::pdev_gc_actor(uint32_t pdev_id, std::shared_ptr< HeapC
std::shared_ptr< GCBlobIndexTable > index_table, HSHomeObject* homeobject) :
m_pdev_id{pdev_id},
m_chunk_selector{chunk_selector},
m_reserved_chunk_queue{RESERVED_CHUNK_NUM_PER_PDEV},
m_reserved_chunk_queue{HS_BACKEND_DYNAMIC_CONFIG(reserved_chunk_num_per_pdev)},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it hot swappable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it is not support yet, it is finalized when the first time bootstrape.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then maybe lets put it into metablks? What would happen if this value changed after bootstrap?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing will happen. ATM, all these parameters will only take effect at the initial step and never change again after that. we can make it hot swappable when we really need it

@JacksonYao287 JacksonYao287 force-pushed the gc-data-copy branch 5 times, most recently from 5a8cdd5 to 16c6a02 Compare June 2, 2025 05:33
data_sgs.iovs.emplace_back(
iovec{.iov_base = iomanager.iobuf_alloc(blk_size, total_size), .iov_len = total_size});

futs.emplace_back(std::move(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really not sure the disk behavior of read a blob then write that blob to elsewhere... will that generate lots of random IO? Hope we can verify the io behavior and performance in further testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this will definitely generate some random io since the are probabaly in different HDD sector. we can limit the impact by rate limiter which will constrain the traffic from gc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is the case, it seems make more sense to fetch a batch of blobs then write a batch of blobs. Rate-limiting doesnt help an in-efficient IO become efficient...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point here does not want an in-efficient IO to become efficient

I mean, through the ratelimiter, we can constrain the amount of in-efficent io, so that inefficient io will only take a small part of total io. let me think it a bit more, we can cycle back on this later

@@ -212,7 +222,7 @@ GCManager::pdev_gc_actor::pdev_gc_actor(uint32_t pdev_id, std::shared_ptr< HeapC
std::shared_ptr< GCBlobIndexTable > index_table, HSHomeObject* homeobject) :
m_pdev_id{pdev_id},
m_chunk_selector{chunk_selector},
m_reserved_chunk_queue{RESERVED_CHUNK_NUM_PER_PDEV},
m_reserved_chunk_queue{HS_BACKEND_DYNAMIC_CONFIG(reserved_chunk_num_per_pdev)},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then maybe lets put it into metablks? What would happen if this value changed after bootstrap?

// time.
// TODO:: handle the error case here.
LOGERROR("Failed to query blobs in gc index table for ret={} move_to_chunk={}", ret, move_to_chunk);
m_reserved_chunk_queue.blockingWrite(move_to_chunk);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a bit wired to add chunk back to queue here, as the chunk was dequed elsewhere not in the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the concern here is why does replacing index fails.

1 if it fails before changing pg index table, we can return a false, and put the move_to_chunk back to the queue safely.

2 if it fails (or crash) when changing pg index table, partial update probably happens. in this case, we can not put it to the queue( since some of the pg blob index has been updated and some pba points to the move_to_chunk now) and let it used by other gc tasks.

so I keep it like this to differentiate the safe case and unsafe case. I can refine this by return a status code or separated a new function. but let`s cycle back on this later after all the functionality works well

Comment on lines +811 to +814
// considering the complexity of gc crash recovery for tombstone_blob_count, we get it directly from index table
// , which is the most accurate.

// TODO::do we need this as durable entity? remove it and got all the from pg index in real time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those metrics is not for 100% accurate but more for metrics. agree the point of maintaining these metrics durable introduce costs, but also want to comfirm how long it will take if we go through index during boot up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I memory serves , I had a benchmark for rangequery and it is fast. I will double confirm it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should be if a SM filled up with 128K objects, how long it will take to go through the index during startup.

if (!is_emergent || !is_last_shard) continue;
}

// prepare a shard header for this shard in move_to_chunk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I failed to follow this part.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 if the shard header part.

for a sealed shard, the state in its shard header blk is open, and the state in its shard footer blk is sealed.. now we always generate the shard metablk by shard in-memory metadata , not reading it from disk since we do not maintain the pba of shard header and footer in indexsvc. as a result , for a sealed shard, after we build the shard metablk in memory(the in-memory state is sealed now), the state of the in-memory is sealed, so we need to change it to open before write it to disk since the state in shard header should always be open.

however, in emergent gc, the last shard of the emergent chunk is open, not sealed. we can use it directly.

2 if this part

        if (valid_blob_indexes.empty()) {
            LOGDEBUG("empty shard found in move_from_chunk, chunk_id={}, shard_id={}", move_from_chunk, shard_id);
            // TODO::send a delete shard request to raft channel. there is a case that when we are doing gc, the
            // shard becomes empty, need to handle this case

            // we should always write a shard header for the last shard of emergent gc.
            if (!is_emergent || !is_last_shard) continue;
        }

for all empty shard(not valid blobs in this shard), we can skip handle it. but if it is the last shard of emergent chunk for emergent gc, even if it is empty , we should write a shard header for it. it is the case that the last shard in a chunk only has a shard header and trigger emergent gc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a sealed shard, the state in its shard header blk is open, and the state in its shard footer blk is sealed.. now we always generate the shard metablk by shard in-memory metadata , not reading it from disk since we do not maintain the pba of shard header and footer in indexsvc. as a result , for a sealed shard, after we build the shard metablk in memory(the in-memory state is sealed now), the state of the in-memory is sealed, so we need to change it to open before write it to disk since the state in shard header should always be open.

I doubt we need to do that , actually the state of shard header is not reliable (and maybe we should move the state out of it). As you point out we generate the header based on metablk so the BR/Incremental resync already set the header::open to false.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am not also quite sure when we will use it (for metrics or scrubbing?), so I try to write the shard header as it used to be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about lets make the change and see ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changes do you want to make? I can try

1 move the shard state out of shard header blk?
or
2 when data copy , we do not change the state to open in shard header blk?

for myself, I tend to keep the copied shard header the same as its source, which means the state in shard header blk should always be open.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to try 1 and then 2 is the gift.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, for shard header blk, we only write once and never read it (we do not know where it is, we don`t record it in index table), so the content(including shard state) in it does not matter, right?

if we want to remove it out of shard header blk , we need to adjust the allocated memory size carefully, since we allocate buffer size according to the size of shard_info_superblk, which contains the size of shard state.

    sisl::io_blob_safe sb_blob(sisl::round_up(sizeof(shard_info_superblk), repl_dev->get_blk_size()), io_align);
    shard_info_superblk* sb = new (sb_blob.bytes()) shard_info_superblk();
    sb->type = DataHeader::data_type_t::SHARD_INFO;
    sb->info = ShardInfo{.id = new_shard_id,
                         .placement_group = pg_owner,
                         .state = ShardInfo::State::OPEN,
                         .lsn = 0,
                         .created_time = create_time,
                         .last_modified_time = create_time,
                         .available_capacity_bytes = size_bytes,
                         .total_capacity_bytes = size_bytes};

what about I comments this part of code, so that we do not care about the shard state in shard header blk when data copy. after data copy, when don`t know where it is and never read it


        if (!is_last_shard_in_emergent_chunk) {
            // for the sealed shard, the shard state in header should also be open.now, the written header is the same
            // as footer except the shard state, so we lost the original header.
            r_cast< HSHomeObject::shard_info_superblk* >(header_sgs.iovs[0].iov_base)->info.state =
                ShardInfo::State::OPEN;

            // TODO:: get the original header from the move_from_chunk and change the following part if needed.
            /*
            uint64_t created_time;
            uint64_t last_modified_time;
            uint64_t available_capacity_bytes;
            uint64_t total_capacity_bytes;
            */
        }


LOGDEBUG("all valid blobs are copied from move_from_chunk={} to move_to_chunk={}", move_from_chunk, move_to_chunk);

// we need to commit_blk for the move_to_chunk to make sure the last offset of append_blk_allocator is updated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we refactor this function a bit

  1. lets sync write header via async_alloc_write().get().
  2. parallel submit for all data blobs , returns PBA from the async_alloc_write().thenValue().
  3. wait all data write finish , get the max PBA
  4. (except last shard) sync write footer.

This function would be more readable as well as we can get the PBA we need to commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I make this fully aysnc and if any op fails, we fails the whole copy_shard. so no need to check all kinds of the failed cases.
is it more simple to commit to the last blk? I perfer this way.

in the last shard in emergent chunk, if all the blobs of this shard are written concurrently, we can not make sure which blk is last written. you mean compare the blk one by one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes a for loop serves the purpose.
making an async function complicate to 100+ lines then follow a get() is not that "async" . Personally I would prefer just async those needed part(data copy) and leave other part (hearder/footer) sync as I failed to follow the complicate then-value structure.

But anyway , it is your choice.

@JacksonYao287 JacksonYao287 force-pushed the gc-data-copy branch 2 times, most recently from bc3abf3 to 02f24d3 Compare June 5, 2025 10:27
@xiaoxichen xiaoxichen requested review from Besroy and yuwmao June 6, 2025 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants