-
Notifications
You must be signed in to change notification settings - Fork 16
implement gc data copy #303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #303 +/- ##
==========================================
- Coverage 63.15% 62.96% -0.19%
==========================================
Files 32 35 +3
Lines 1900 3786 +1886
Branches 204 453 +249
==========================================
+ Hits 1200 2384 +1184
- Misses 600 1150 +550
- Partials 100 252 +152 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
98b208d
to
c3cc258
Compare
@@ -212,7 +222,7 @@ GCManager::pdev_gc_actor::pdev_gc_actor(uint32_t pdev_id, std::shared_ptr< HeapC | |||
std::shared_ptr< GCBlobIndexTable > index_table, HSHomeObject* homeobject) : | |||
m_pdev_id{pdev_id}, | |||
m_chunk_selector{chunk_selector}, | |||
m_reserved_chunk_queue{RESERVED_CHUNK_NUM_PER_PDEV}, | |||
m_reserved_chunk_queue{HS_BACKEND_DYNAMIC_CONFIG(reserved_chunk_num_per_pdev)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it hot swappable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it is not support yet, it is finalized when the first time bootstrape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then maybe lets put it into metablks? What would happen if this value changed after bootstrap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing will happen. ATM, all these parameters will only take effect at the initial step and never change again after that. we can make it hot swappable when we really need it
5a8cdd5
to
16c6a02
Compare
16c6a02
to
718f081
Compare
data_sgs.iovs.emplace_back( | ||
iovec{.iov_base = iomanager.iobuf_alloc(blk_size, total_size), .iov_len = total_size}); | ||
|
||
futs.emplace_back(std::move( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really not sure the disk behavior of read a blob then write that blob to elsewhere... will that generate lots of random IO? Hope we can verify the io behavior and performance in further testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this will definitely generate some random io since the are probabaly in different HDD sector. we can limit the impact by rate limiter which will constrain the traffic from gc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is the case, it seems make more sense to fetch a batch of blobs then write a batch of blobs. Rate-limiting doesnt help an in-efficient IO become efficient...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point here does not want an in-efficient IO to become efficient
I mean, through the ratelimiter, we can constrain the amount of in-efficent io, so that inefficient io will only take a small part of total io. let me think it a bit more, we can cycle back on this later
@@ -212,7 +222,7 @@ GCManager::pdev_gc_actor::pdev_gc_actor(uint32_t pdev_id, std::shared_ptr< HeapC | |||
std::shared_ptr< GCBlobIndexTable > index_table, HSHomeObject* homeobject) : | |||
m_pdev_id{pdev_id}, | |||
m_chunk_selector{chunk_selector}, | |||
m_reserved_chunk_queue{RESERVED_CHUNK_NUM_PER_PDEV}, | |||
m_reserved_chunk_queue{HS_BACKEND_DYNAMIC_CONFIG(reserved_chunk_num_per_pdev)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then maybe lets put it into metablks? What would happen if this value changed after bootstrap?
// time. | ||
// TODO:: handle the error case here. | ||
LOGERROR("Failed to query blobs in gc index table for ret={} move_to_chunk={}", ret, move_to_chunk); | ||
m_reserved_chunk_queue.blockingWrite(move_to_chunk); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a bit wired to add chunk back to queue here, as the chunk was dequed elsewhere not in the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the concern here is why does replacing index fails.
1 if it fails before changing pg index table, we can return a false, and put the move_to_chunk back to the queue safely.
2 if it fails (or crash) when changing pg index table, partial update probably happens. in this case, we can not put it to the queue( since some of the pg blob index has been updated and some pba points to the move_to_chunk now) and let it used by other gc tasks.
so I keep it like this to differentiate the safe case and unsafe case. I can refine this by return a status code or separated a new function. but let`s cycle back on this later after all the functionality works well
// considering the complexity of gc crash recovery for tombstone_blob_count, we get it directly from index table | ||
// , which is the most accurate. | ||
|
||
// TODO::do we need this as durable entity? remove it and got all the from pg index in real time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those metrics is not for 100% accurate but more for metrics. agree the point of maintaining these metrics durable introduce costs, but also want to comfirm how long it will take if we go through index during boot up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I memory serves , I had a benchmark for rangequery and it is fast. I will double confirm it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should be if a SM filled up with 128K objects, how long it will take to go through the index during startup.
if (!is_emergent || !is_last_shard) continue; | ||
} | ||
|
||
// prepare a shard header for this shard in move_to_chunk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I failed to follow this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 if the shard header part.
for a sealed shard, the state in its shard header blk is open, and the state in its shard footer blk is sealed.. now we always generate the shard metablk by shard in-memory metadata , not reading it from disk since we do not maintain the pba of shard header and footer in indexsvc. as a result , for a sealed shard, after we build the shard metablk in memory(the in-memory state is sealed now), the state of the in-memory is sealed, so we need to change it to open before write it to disk since the state in shard header should always be open.
however, in emergent gc, the last shard of the emergent chunk is open, not sealed. we can use it directly.
2 if this part
if (valid_blob_indexes.empty()) {
LOGDEBUG("empty shard found in move_from_chunk, chunk_id={}, shard_id={}", move_from_chunk, shard_id);
// TODO::send a delete shard request to raft channel. there is a case that when we are doing gc, the
// shard becomes empty, need to handle this case
// we should always write a shard header for the last shard of emergent gc.
if (!is_emergent || !is_last_shard) continue;
}
for all empty shard(not valid blobs in this shard), we can skip handle it. but if it is the last shard of emergent chunk for emergent gc, even if it is empty , we should write a shard header for it. it is the case that the last shard in a chunk only has a shard header and trigger emergent gc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a sealed shard, the state in its shard header blk is open, and the state in its shard footer blk is sealed.. now we always generate the shard metablk by shard in-memory metadata , not reading it from disk since we do not maintain the pba of shard header and footer in indexsvc. as a result , for a sealed shard, after we build the shard metablk in memory(the in-memory state is sealed now), the state of the in-memory is sealed, so we need to change it to open before write it to disk since the state in shard header should always be open.
I doubt we need to do that , actually the state of shard header is not reliable (and maybe we should move the state out of it). As you point out we generate the header based on metablk so the BR/Incremental resync already set the header::open to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I am not also quite sure when we will use it (for metrics or scrubbing?), so I try to write the shard header as it used to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about lets make the change and see ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changes do you want to make? I can try
1 move the shard state out of shard header blk?
or
2 when data copy , we do not change the state to open in shard header blk?
for myself, I tend to keep the copied shard header the same as its source, which means the state in shard header blk should always be open.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to try 1 and then 2 is the gift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, for shard header blk, we only write once and never read it (we do not know where it is, we don`t record it in index table), so the content(including shard state) in it does not matter, right?
if we want to remove it out of shard header blk , we need to adjust the allocated memory size carefully, since we allocate buffer size according to the size of shard_info_superblk, which contains the size of shard state.
sisl::io_blob_safe sb_blob(sisl::round_up(sizeof(shard_info_superblk), repl_dev->get_blk_size()), io_align);
shard_info_superblk* sb = new (sb_blob.bytes()) shard_info_superblk();
sb->type = DataHeader::data_type_t::SHARD_INFO;
sb->info = ShardInfo{.id = new_shard_id,
.placement_group = pg_owner,
.state = ShardInfo::State::OPEN,
.lsn = 0,
.created_time = create_time,
.last_modified_time = create_time,
.available_capacity_bytes = size_bytes,
.total_capacity_bytes = size_bytes};
what about I comments this part of code, so that we do not care about the shard state in shard header blk when data copy. after data copy, when don`t know where it is and never read it
if (!is_last_shard_in_emergent_chunk) {
// for the sealed shard, the shard state in header should also be open.now, the written header is the same
// as footer except the shard state, so we lost the original header.
r_cast< HSHomeObject::shard_info_superblk* >(header_sgs.iovs[0].iov_base)->info.state =
ShardInfo::State::OPEN;
// TODO:: get the original header from the move_from_chunk and change the following part if needed.
/*
uint64_t created_time;
uint64_t last_modified_time;
uint64_t available_capacity_bytes;
uint64_t total_capacity_bytes;
*/
}
|
||
LOGDEBUG("all valid blobs are copied from move_from_chunk={} to move_to_chunk={}", move_from_chunk, move_to_chunk); | ||
|
||
// we need to commit_blk for the move_to_chunk to make sure the last offset of append_blk_allocator is updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we refactor this function a bit
- lets sync write header via async_alloc_write().get().
- parallel submit for all data blobs , returns PBA from the async_alloc_write().thenValue().
- wait all data write finish , get the max PBA
- (except last shard) sync write footer.
This function would be more readable as well as we can get the PBA we need to commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make this fully aysnc and if any op fails, we fails the whole copy_shard. so no need to check all kinds of the failed cases.
is it more simple to commit to the last blk? I perfer this way.
in the last shard in emergent chunk, if all the blobs of this shard are written concurrently, we can not make sure which blk is last written. you mean compare the blk one by one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes a for loop serves the purpose.
making an async function complicate to 100+ lines then follow a get() is not that "async" . Personally I would prefer just async those needed part(data copy) and leave other part (hearder/footer) sync as I failed to follow the complicate then-value structure.
But anyway , it is your choice.
bc3abf3
to
02f24d3
Compare
crash recovery is also in this PR