Don't zero-after-free `DataStream`: Faster IBD on some configurations #30987

davidgumberg · 2024-09-26T22:40:32Z

This PR modifies DataStream's byte-vector vch to use the default allocator std::allocator rather than the zero_after_free_allocator which degrades performance greatly. The zero_after_free_allocator is identical to the default std::allocator except that it zeroes memory using memory_cleanse() before deallocating.

This PR also drops the zero_after_free_allocator, since this was only used by DataStream and SerializeData.

In my testing (n=2) on a Raspberry Pi 5 with 4GB of memory, syncing from a fast connection to a stable dedicated node, my branch takes ~74% of the time taken by master¹ to sync to height 815,000; average wall clock time was 35h 58m 40s on this branch and 48h 17m 15s on master. (See the benchmarking appendix)

I expect most of the performance improvement to come from the use of DataStream for all CDBWrapper keys and values, and for all P2P messages. I suspect there are other use cases where performance is improved, but I have only tested IBD.

Any objects that contains secrets should not be allocated using zero_after_free_allocator since they are liable to get mapped to swap space and written to disk if the user is running low on memory, and I intuit this is a likelier path than scanning unzero'd memory for an attacker to find cryptographic secrets. Secrets should be allocated using secure_allocator which cleanses on deallocation and mlock()s the memory reserved for secrets to prevent it from being mapped to swap space.

Are any secrets stored in `DataStream` that will lose security?

I have reviewed every appearance of DataStream and SerializeData as of 39219fe and have made notes in the appendix below with notes that provide context for each instance where either is used.

The only use case that I wasn't certain of is PSBT's, I believe these are never secrets, but I am not certain if there are use cases where PSBT's are worthy of being treated as secrets, and being vigilant about not writing them to disk is wise.

As I understand, most of the use of DataStream in the wallet code is for the reading and writing of "crypted" key and value data, and they get decrypted somewhere else in a ScriptPubKeyMan far away from any DataStream container, but I could also be wrong about this, or have misunderstood its use elsewhere in the wallet.

Zero-after-free as a buffer overflow mitigation

The zero_after_free allocator was added as a buffer overflow mitigation, the idea being that DataStream's store a lot of unsecured data that we don't control like the UTXO set and all P2P messages, and an attacker could fill memory in a predictable way to escalate a buffer overflow into an RCE. (See Historical Background appendix).

I agree completely with practicing security in depth, but I don't think this mitigation is worth the performance hit because:

Aren't there still an abundance of other opportunities for an attacker to fill memory that never gets deallocated?
Doesn't ASLR mostly mitigate this issue and don't most devices have some form of ASLR?

I'm not a security expert and I had a hard time finding any writing anywhere that discusses this particular mitigation strategy of zeroing memory, so I hope someone with more knowledge of memory vulnerabilities can assist.

Other notes

I opted to leave SerializeData as std::vector<std::byte> instead of deleting it and refactoring in the spots where it's used in the wallet to keep the PR small, if others think it would be better to delete it I would be happy to do it.
I have a feeling that it's not just that we're memsetting everything to 0 in memory_cleanse that is causing the performance issue, but the trick we do to prevent compilers from optimizing out the memset call is also preventing other optimizations on the DataStream's, but I have yet to test this.
I also make a small change to a unit test where boost mysteriously fails to find a left shift-operator for SerializeData once it loses its custom allocator.

Master at the time of my testing was: 6d546336e800 ↩

DrahtBot · 2024-09-26T22:40:35Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/30987.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	laanwj

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#31868 ([IBD] specialize block serialization by l0rinc)
#31519 (refactor: Use std::span over Span by maflcko)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot · 2024-09-26T23:56:10Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/30733746913}

Hints

Make sure to run all tests locally, according to the documentation.

The failure may happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

davidgumberg · 2024-09-27T15:00:38Z

Appendix

Benchmarks

Command being timed:

./src/bitcoind -daemon=0 -connect=amd-ryzen-7900x-node:8333 -stopatheight=815000 -port=8444 -rpcport=8445 -dbcache=2048 -prune=550 -debug=bench -debug=blockstorage -debug=coindb -debug=mempool -debug=prune"

I applied my branch on 6d546336e800, which is "master" in the data below.

Average master time (hh:mm:ss): 48:17:15 (173835s)
Average branch time (hh:mm:ss): 35:58:40 (129520s)

~25% reduction in IBD time on a raspberry Pi 5 with a DB cache of 2GB.

Master run 1

Wall clock time (hh:mm:ss): 49:38:31 (178711s)

Bitcoin Core version v27.99.0-6d546336e800 (release build)
- Connect block: 158290.53s (620.94ms/blk)
    - Sanity checks: 10.89s (0.01ms/blk)
    - Fork checks: 151.82s (0.02ms/blk)
    - Verify 7077 txins: 135057.68s (165.71ms/blk)
      - Connect 1760 transactions: 134786.36s (165.38ms/blk)
    - Write undo data: 2681.34s (7.38ms/blk)
    - Index writing: 52.76s (0.03ms/blk)
  - Connect total: 138100.75s (611.27ms/blk)
  - Flush: 3933.29s (8.97ms/blk)
  - Writing chainstate: 15814.36s (0.14ms/blk)
  - Connect postprocess: 273.39s (0.52ms/blk)

Master run 2

Wall clock time (hh:mm:ss): 46:55:58 (168958s)

Bitcoin Core version v27.99.0-6d546336e800 (release build)
- Connect block: 145449.95s (940.78ms/blk)
    - Sanity checks: 10.69s (0.01ms/blk)
    - Fork checks: 155.81s (0.02ms/blk)
    - Verify 7077 txins: 115935.55s (142.25ms/blk)
      - Connect 1760 transactions: 115481.15s (141.69ms/blk)
    - Write undo data: 2561.36s (9.05ms/blk)
    - Index writing: 73.63s (0.04ms/blk)
  - Connect total: 118877.56s (929.93ms/blk)
  - Flush: 3864.34s (10.11ms/blk)
  - Writing chainstate: 22294.82s (0.14ms/blk)
  - Connect postprocess: 267.68s (0.56ms/blk)

Branch run 1

Wall clock time (hh:mm:ss): 34:28:56 (124136s)

Bitcoin Core version v27.99.0-a0dddf8b4092 (release build)
- Connect block: 107134.59s (1017.01ms/blk)
    - Sanity checks: 11.01s (0.01ms/blk)
    - Fork checks: 150.93s (0.03ms/blk)
    - Verify 7077 txins: 87446.53s (107.30ms/blk)
      - Connect 1760 transactions: 87329.99s (107.15ms/blk)
    - Write undo data: 2495.47s (7.36ms/blk)
    - Index writing: 37.95s (0.04ms/blk)
  - Connect total: 90318.60s (1006.42ms/blk)
  - Flush: 3917.28s (9.92ms/blk)
  - Writing chainstate: 12560.43s (0.15ms/blk)
  - Connect postprocess: 259.89s (0.47ms/blk)

Branch run 2

Wall clock time (hh:mm:ss): 37:28:24 (134904s)

Bitcoin Core version v27.99.0-a0dddf8b4092 (release build)
- Connect block: 117991.55s (144.77ms/blk)
  - Connect total: 101298.20s (124.29ms/blk)
    - Sanity checks: 11.17s (0.01ms/blk)
    - Fork checks: 151.24s (0.19ms/blk)
    - Verify 7077 txins: 98446.38s (120.79ms/blk)
      - Connect 1760 transactions: 98339.79s (120.66ms/blk)
    - Write undo data: 2484.75s (3.05ms/blk)
    - Index writing: 36.62s (0.04ms/blk)
  - Flush: 3892.28s (4.78ms/blk)
  - Writing chainstate: 12446.33s (15.27ms/blk)
  - Connect postprocess: 259.11s (0.32ms/blk)

Historical background

At some point prior to the oldest git commit for the repo, an allocator secure_allocator was added that zeroes out memory on deallocation with memset(), and was used as the allocator for the vector vch in CDataStream (now DataStream).

In July 2011, PR #352 adding support for encrypted wallets modified secure_allocator to also mlock() data on allocation to prevent the wallet passphrase or other secrets from being paged to swap space (written to disk).

In January 2012, findings were shared (https://bitcointalk.org/index.php?topic=56491.0) that #352 modifying CDataStream's allocator slowed down IBD substantially¹, since CDataStream was used in many places that did not need the guarantees of mlock(), and since every call to mlock() results in a flush of the TLB (a cache that maps virtual memory to physical memory).

PR #740 was opened to fix this, initially² by removing the custom allocator secure_allocator from CDataStream's vector_type:

 class CDataStream
 {
 protected:
-    typedef std::vector<char, secure_allocator<char> > vector_type;
+    typedef std::vector<char> vector_type;
     vector_type vch;

A reviewer of #740 suggested that dropping mlock() was a good idea, but that the original behavior of zeroing-after-freeing (should it be zeroing-before-freeing?) CDataStream should be restored as a mitigation for buffer overflows:

I love the performance improvement, but I still don't like the elimination of zero-after-free. Security in depth is important.

Here's the danger:

Attacker finds a remotely-exploitable buffer overrun somewhere in the networking code that crashes the process.
They turn the crash into a full remote exploit by sending carefully constructed packets before the crash packet, to initialize used-but-then-freed memory to a known state.

Unlikely? Sure.

Is it ugly to define a zero_after_free_allocator for CDataStream? Sure. (simplest implementation: copy secure_allocator, remove the mlock/munlock calls).

But given that CDataStream is the primary interface between bitcoin and the network, I think being extra paranoid here is a very good idea.

Another reviewer benchmarked CDataStream with an allocator that zeroed memory using memset without mlocking it and found that performance was almost identical to the default allocator, while both were substantially faster than the mlocking variant of CDataStream. (https://web.archive.org/web/20130622160044/https://people.xiph.org/~greg/bitcoin-sync.png).

Based on the benchmark, and the potential security benefit, the zero_after_free allocator was created and used as CDataStream's allocator.

In November 2012, PR #1992 was opened to address the fact that in many cases memset() calls are optimized away by compilers as part of a family of compiler optimizations called dead store elimination by replacing the memset call with openssl's OPENSSL_cleanse which is meant to solve this problem by doing things that spook compilers into not wanting to optimize the memset. Given that all of the data being zero'ed out in the deallocator is also having it's only pointer destroyed, these memset calls were candidates for being optimized.

I suspect that the reason no performance regression was found in the benchmarking of #740 which introduced the zero_after_free allocator is that the memset calls were being optimized out.

I am not the first to suggest that this is a performance issue:

https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2015-11-06#1446837840-1446854100;

https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2016-11-23#1479883620-1479882900;

Or to write a patch changing it:

671c724

All uses of DataStream and SerializeData

( ⚠️ when opening: very long)

I performed this review on commit 39219fe145e5e6e6f079b591e3f4b5fea8e71804

I look, briefly, at every single use of DataStream outside of test code, to see whether or not it contains secret information that should be zeroed out, or should be mlocked to prevent paging to swap.

I've taken liberties to editorialize some of the codeblocks below for legibility, and all comments that have [] are my own.

`DataStream`

In src/addrdb.cpp+src/addrdb.h:

/** Only used by tests. */
void ReadFromStream(AddrMan& addr, DataStream& ssPeers);

Only used by tests.

In src/addrman.cpp Addrman::Serialize(DataStream&) & Unserialize(DataStream&), are explicitly instantiated, these are used in SerializeFileDB and DeserializeDB which are used to serialize (DumpPeerAddresses) addrman to disk, and to deserialize addrman from disk (LoadAddrman).

The most valuable secret seems to be addrman's nKey used to determine the address buckets randomly.

In src/blockencodings.cpp:

void CBlockHeaderAndShortTxIDs::FillShortTxIDSelector() const {
    DataStream stream{};
    stream << header << nonce;
    CSHA256 hasher;
    hasher.Write((unsigned char*)&(*stream.begin()), stream.end() - stream.begin());
    uint256 shorttxidhash;
    hasher.Finalize(shorttxidhash.begin());
    shorttxidk0 = shorttxidhash.GetUint64(0);
    shorttxidk1 = shorttxidhash.GetUint64(1);
}

Here we are just using the DataStream to be able to Serialize the block header and nonce into a string of bytes that get hashed to make short id k0 and k1 for BIP 152.

This gets invoked when we construct a CBlockHeaderandShortTxIDs for an INV of type MSG_CMPCT_BLOCK in PeerManagerImpl::SendMessage().

In src/common/blooms.cpp:

DataStream is used to deserialize outpoints into our bloom filter, these are not secrets in any way:

void CBloomFilter::insert(const COutPoint& outpoint)
{
    DataStream stream{};
    stream << outpoint;
    insert(MakeUCharSpan(stream));
}

In src/core_read.cpp:

DataStream is used in DecodeTx for serialization/deserialization of the transaction data, used afaict only in RPC's for deserializing user arguments into CMutableTransaction's.

It's used in DecodeHexBlockHeader()which deserializes a block header argument into a CBlockHeader for the submitheader rpc.

Similar for DecodeHexBlk() used by the getblocktemplate and submitblock rpc's.

In src/core_write.cpp:

void CBloomFilter::insert(const COutPoint& outpoint)
{
    DataStream stream{};
    stream << outpoint;
    insert(MakeUCharSpan(stream));
}

EncodeHexTx is only used in RPC's, and transaction data does not contain secrets.

In dbwrapper.h and dbwrapper.cpp it is used exclusively to serialize and deserialize coinsdb keys and values, none of which is secret.

In src/external_signer:

bool ExternalSigner::SignTransaction(PartiallySignedTransaction& psbtx, std::string& error)
{
    // Serialize the PSBT
    DataStream ssTx{};
    ssTx << psbtx;

I don't think this is a secret, but I don't know enough about PSBT's to be sure.

There is some scaffolding for being able to transmit serializable stuff over the IPC wire in src/capnp/common-types.h, I assume this depends on how it's used, nothing essentially secret.

In src/kernel/coinstats.cpp:

void ApplyCoinHash(MuHash3072& muhash, const COutPoint& outpoint, const Coin& coin)
{
    DataStream ss{};
    TxOutSer(ss, outpoint, coin);
    muhash.Insert(MakeUCharSpan(ss));
}

Here it's used for serializing oupoints and coins for creating the AssumeUTXO assumed utxo set hash, nothing secret.

In src/net.cpp:

In ConvertSeeds() serialized seeds get converted into usable address objects, we initialize a DataStream with the input seeds that we are going to try connecting to during node bootstrapping.

//! Convert the serialized seeds into usable address objects.
static std::vector<CAddress> ConvertSeeds(const std::vector<uint8_t> &vSeedsIn)
{
    // It'll only connect to one or two seed nodes because once it connects,
    // it'll get a pile of addresses with newer timestamps.
    // Seed nodes are given a random 'last seen time' of between one and two
    // weeks ago.
    const auto one_week{7 * 24h};
    std::vector<CAddress> vSeedsOut;
    FastRandomContext rng;
    ParamsStream s{DataStream{vSeedsIn}, CAddress::V2_NETWORK};
    while (!s.eof()) {
        CService endpoint;
        s >> endpoint;
        CAddress addr{endpoint, SeedsServiceFlags()};
        addr.nTime = rng.rand_uniform_delay(Now<NodeSeconds>() - one_week, -one_week);
        LogDebug(BCLog::NET, "Added hardcoded seed: %s\n", addr.ToStringAddrPort());
        vSeedsOut.push_back(addr);
    }
    return vSeedsOut;
}

It is also used for creating an empty CNetMessage which has a DataStream member in CNetMessage V2Transport::GetReceivedMessage():

//! Convert the serialized seeds into usable address objects.
static std::vector<CAddress> ConvertSeeds(const std::vector<uint8_t> &vSeedsIn)
{
    // It'll only connect to one or two seed nodes because once it connects,
    // it'll get a pile of addresses with newer timestamps.
    // Seed nodes are given a random 'last seen time' of between one and two
    // weeks ago.
    const auto one_week{7 * 24h};
    std::vector<CAddress> vSeedsOut;
    FastRandomContext rng;
    ParamsStream s{DataStream{vSeedsIn}, CAddress::V2_NETWORK};
    while (!s.eof()) {
        CService endpoint;
        s >> endpoint;
        CAddress addr{endpoint, SeedsServiceFlags()};
        addr.nTime = rng.rand_uniform_delay(Now<NodeSeconds>() - one_week, -one_week);
        LogDebug(BCLog::NET, "Added hardcoded seed: %s\n", addr.ToStringAddrPort());
        vSeedsOut.push_back(addr);
    }
    return vSeedsOut;
}

In net.h

CNetMessage the universal p2p message container used a DataStream to store received message data.

/** Transport protocol agnostic message container.
 * Ideally it should only contain receive time, payload,
 * type and size.
 */
class CNetMessage
{
public:
    DataStream m_recv;                   //!< received message data
    std::chrono::microseconds m_time{0}; //!< time of message receipt
    uint32_t m_message_size{0};          //!< size of the payload
    uint32_t m_raw_message_size{0};      //!< used wire size of the message (including header/checksum)
    std::string m_type;

    explicit CNetMessage(DataStream&& recv_in) : m_recv(std::move(recv_in)) {}
    // Only one CNetMessage object will exist for the same message on either
    // the receive or processing queue. For performance reasons we therefore
    // delete the copy constructor and assignment operator to avoid the
    // possibility of copying CNetMessage objects.
    CNetMessage(CNetMessage&&) = default;
    CNetMessage(const CNetMessage&) = delete;
    CNetMessage& operator=(CNetMessage&&) = default;
    CNetMessage& operator=(const CNetMessage&) = delete;
};

It's also used for the lower level handling of messages, including partially received header buffers and received socket data in V1Transport as in v2 transport above in net.cpp.

/** Transport protocol agnostic message container.
 * Ideally it should only contain receive time, payload,
 * type and size.
 */
class CNetMessage
{
public:
    DataStream m_recv;                   //!< received message data
    std::chrono::microseconds m_time{0}; //!< time of message receipt
    uint32_t m_message_size{0};          //!< size of the payload
    uint32_t m_raw_message_size{0};      //!< used wire size of the message (including header/checksum)
    std::string m_type;

    explicit CNetMessage(DataStream&& recv_in) : m_recv(std::move(recv_in)) {}
    // Only one CNetMessage object will exist for the same message on either
    // the receive or processing queue. For performance reasons we therefore
    // delete the copy constructor and assignment operator to avoid the
    // possibility of copying CNetMessage objects.
    CNetMessage(CNetMessage&&) = default;
    CNetMessage(const CNetMessage&) = delete;
    CNetMessage& operator=(CNetMessage&&) = default;
    CNetMessage& operator=(const CNetMessage&) = delete;
};

In src/net_processing.cpp it used for representing the received data when
processing messages in the great PeerManagerImpl::ProcessMessage():

void PeerManagerImpl::ProcessMessage(CNode& pfrom, const std::string& msg_type, DataStream& vRecv,
                                     const std::chrono::microseconds time_received,
                                     const std::atomic<bool>& interruptMsgProc)
{

And for Processing BIP 157 cfilters:

/**
 * Handle a cfilters request.
 *
 * May disconnect from the peer in the case of a bad request.
 *
 * @param[in]   node            The node that we received the request from
 * @param[in]   peer            The peer that we received the request from
 * @param[in]   vRecv           The raw message received
 */
void PeerManagerImpl::ProcessGetCFilters(CNode& node, Peer& peer, DataStream& vRecv)
{
    uint8_t filter_type_ser;
    uint32_t start_height;
    uint256 stop_hash;

    vRecv >> filter_type_ser >> start_height >> stop_hash;

    const BlockFilterType filter_type = static_cast<BlockFilterType>(filter_type_ser);

    const CBlockIndex* stop_index;
    BlockFilterIndex* filter_index;
    if (!PrepareBlockFilterRequest(node, peer, filter_type, start_height, stop_hash,
                                   MAX_GETCFILTERS_SIZE, stop_index, filter_index)) {
        return;
    }

    std::vector<BlockFilter> filters;
    if (!filter_index->LookupFilterRange(start_height, stop_index, filters)) {
        LogDebug(BCLog::NET, "Failed to find block filter in index: filter_type=%s, start_height=%d, stop_hash=%s\n",
                     BlockFilterTypeName(filter_type), start_height, stop_hash.ToString());
        return;
    }

    for (const auto& filter : filters) {
        MakeAndPushMessage(node, NetMsgType::CFILTER, filter);
    }

and bip 157 cfheaders:

/**
 * Handle a cfheaders request.
 *
 * May disconnect from the peer in the case of a bad request.
 *
 * @param[in]   node            The node that we received the request from
 * @param[in]   peer            The peer that we received the request from
 * @param[in]   vRecv           The raw message received
 */
 void PeerManagerImpl::ProcessGetCFHeaders(CNode& node, Peer& peer, DataStream& vRecv)
{
    uint8_t filter_type_ser;
    uint32_t start_height;
    uint256 stop_hash;

    vRecv >> filter_type_ser >> start_height >> stop_hash;

    const BlockFilterType filter_type = static_cast<BlockFilterType>(filter_type_ser);

    const CBlockIndex* stop_index;
    BlockFilterIndex* filter_index;
    if (!PrepareBlockFilterRequest(node, peer, filter_type, start_height, stop_hash,
                                   MAX_GETCFHEADERS_SIZE, stop_index, filter_index)) {
        return;
    }

    uint256 prev_header;
    if (start_height > 0) {
        const CBlockIndex* const prev_block =
            stop_index->GetAncestor(static_cast<int>(start_height - 1));
        if (!filter_index->LookupFilterHeader(prev_block, prev_header)) {
            LogDebug(BCLog::NET, "Failed to find block filter header in index: filter_type=%s, block_hash=%s\n",
                         BlockFilterTypeName(filter_type), prev_block->GetBlockHash().ToString());
            return;
        }
    }

    std::vector<uint256> filter_hashes;
    if (!filter_index->LookupFilterHashRange(start_height, stop_index, filter_hashes)) {
        LogDebug(BCLog::NET, "Failed to find block filter hashes in index: filter_type=%s, start_height=%d, stop_hash=%s\n",
                     BlockFilterTypeName(filter_type), start_height, stop_hash.ToString());
        return;
    }

    MakeAndPushMessage(node, NetMsgType::CFHEADERS,
              filter_type_ser,
              stop_index->GetBlockHash(),
              prev_header,
              filter_hashes);
}

In src/psbt.cpp:

bool DecodeRawPSBT(PartiallySignedTransaction& psbt, Span<const std::byte> tx_data, std::string& error)
{
    DataStream ss_data{tx_data};
    try {
        ss_data >> psbt;
        if (!ss_data.empty()) {
            error = "extra data after PSBT";
            return false;
        }
    } catch (const std::exception& e) {
        error = e.what();
        return false;
    }
    return true;
}

It is used for deserializing hex data into a PartiallySignedTransaction object.

In src/qt/psbtoperationsdialog.cpp:

Bitcoin Qt interface for Copying psbt to clipboard:

void PSBTOperationsDialog::copyToClipboard() {
    DataStream ssTx{};
    ssTx << m_transaction_data;
    GUIUtil::setClipboard(EncodeBase64(ssTx.str()).c_str());
    showStatus(tr("PSBT copied to clipboard."), StatusLevel::INFO);
}

Saving PSBT to disk:

void PSBTOperationsDialog::saveTransaction() {
    DataStream ssTx{};
    ssTx << m_transaction_data;

    QString selected_filter;
    QString filename_suggestion = "";
    bool first = true;
    for (const CTxOut& out : m_transaction_data.tx->vout) {
        if (!first) {
            filename_suggestion.append("-");
        }
        CTxDestination address;
        ExtractDestination(out.scriptPubKey, address);
        QString amount = BitcoinUnits::format(m_client_model->getOptionsModel()->getDisplayUnit(), out.nValue);
        QString address_str = QString::fromStdString(EncodeDestination(address));
        filename_suggestion.append(address_str + "-" + amount);
        first = false;
    }
    filename_suggestion.append(".psbt");
    QString filename = GUIUtil::getSaveFileName(this,
        tr("Save Transaction Data"), filename_suggestion,
        //: Expanded name of the binary PSBT file format. See: BIP 174.
        tr("Partially Signed Transaction (Binary)") + QLatin1String(" (*.psbt)"), &selected_filter);
    if (filename.isEmpty()) {
        return;
    }
    std::ofstream out{filename.toLocal8Bit().data(), std::ofstream::out | std::ofstream::binary};
    out << ssTx.str();
    out.close();
    showStatus(tr("PSBT saved to disk."), StatusLevel::INFO);
}

In src/qt/recentrequestsstablemodel.cpp:

// called when adding a request from the GUI
void RecentRequestsTableModel::addNewRequest(const SendCoinsRecipient &recipient)
{
    RecentRequestEntry newEntry;
    newEntry.id = ++nReceiveRequestsMaxId;
    newEntry.date = QDateTime::currentDateTime();
    newEntry.recipient = recipient;

    DataStream ss{};
    ss << newEntry;

    if (!walletModel->wallet().setAddressReceiveRequest(DecodeDestination(recipient.address.toStdString()), ToString(newEntry.id), ss.str()))
        return;

    addNewRequest(newEntry);
}

I am not very familiar with the GUI but as far as I can tell the RecentRequestsTable stores and displays receive addresses / payment requests that you've generated. Here the SendCoinsRecipient of payment request consists of an address, a label, an amount, and a memo/message. We serialize the recipient and other data about the request, an ID, and a date/time for the request, and then pass the string into a function which will store it in the RecentRequestsTable.

In src/qt/sendcoinsdialog.cpp:

void SendCoinsDialog::presentPSBT(PartiallySignedTransaction& psbtx)
{
    // Serialize the PSBT
    DataStream ssTx{};
    ssTx << psbtx;
    GUIUtil::setClipboard(EncodeBase64(ssTx.str()).c_str());
    QMessageBox msgBox(this);
    //: Caption of "PSBT has been copied" messagebox
    msgBox.setText(tr("Unsigned Transaction", "PSBT copied"));
    msgBox.setInformativeText(tr("The PSBT has been copied to the clipboard. You can also save it."));
    msgBox.setStandardButtons(QMessageBox::Save | QMessageBox::Discard);
    msgBox.setDefaultButton(QMessageBox::Discard);
    msgBox.setObjectName("psbt_copied_message");
    switch (msgBox.exec()) {
    case QMessageBox::Save: {
        QString selectedFilter;
        QString fileNameSuggestion = "";
        bool first = true;
        for (const SendCoinsRecipient &rcp : m_current_transaction->getRecipients()) {
            if (!first) {
                fileNameSuggestion.append(" - ");
            }
            QString labelOrAddress = rcp.label.isEmpty() ? rcp.address : rcp.label;
            QString amount = BitcoinUnits::formatWithUnit(model->getOptionsModel()->getDisplayUnit(), rcp.amount);
            fileNameSuggestion.append(labelOrAddress + "-" + amount);
            first = false;
        }
        fileNameSuggestion.append(".psbt");
        QString filename = GUIUtil::getSaveFileName(this,
            tr("Save Transaction Data"), fileNameSuggestion,
            //: Expanded name of the binary PSBT file format. See: BIP 174.
            tr("Partially Signed Transaction (Binary)") + QLatin1String(" (*.psbt)"), &selectedFilter);
        if (filename.isEmpty()) {
            return;
        }
        std::ofstream out{filename.toLocal8Bit().data(), std::ofstream::out | std::ofstream::binary};
        out << ssTx.str();
        out.close();
        //: Popup message when a PSBT has been saved to a file
        Q_EMIT message(tr("PSBT saved"), tr("PSBT saved to disk"), CClientUIInterface::MSG_INFORMATION);
        break;
    }
    case QMessageBox::Discard:
        break;
    default:
        assert(false);
    } // msgBox.exec()
}

Here it's used to serialize the PSBT in order to display it to the user during the process of sending in the GUI.

In src/qt/walletmodel.cpp:

DataStream's are used to serialize PSBT's when fee bumping a stuck transaction
in:

bool WalletModel::bumpFee(uint256 hash, uint256& new_hash)

and to serialize the sent transaction in WalletModel::sendCoins():

void WalletModel::sendCoins(WalletModelTransaction& transaction)
{
    QByteArray transaction_array; /* store serialized transaction */

    {
        std::vector<std::pair<std::string, std::string>> vOrderForm;
        for (const SendCoinsRecipient &rcp : transaction.getRecipients())
        {
            if (!rcp.message.isEmpty()) // Message from normal bitcoin:URI (bitcoin:123...?message=example)
                vOrderForm.emplace_back("Message", rcp.message.toStdString());
        }

        auto& newTx = transaction.getWtx();
        wallet().commitTransaction(newTx, /*value_map=*/{}, std::move(vOrderForm));

        DataStream ssTx;
        ssTx << TX_WITH_WITNESS(*newTx);
        transaction_array.append((const char*)ssTx.data(), ssTx.size());
    }

    // Add addresses / update labels that we've sent to the address book,
    // and emit coinsSent signal for each recipient
    for (const SendCoinsRecipient &rcp : transaction.getRecipients())
    {
        // [...]
        Q_EMIT coinsSent(this, rcp, transaction_array);
    }

    checkBalanceChanged(m_wallet->getBalances()); // update balance immediately, otherwise there could be a short noticeable delay until pollBalanceChanged hits
}

In src/rest.cpp:

DataStream is used by Bitcoin Core's REST interface to serialize responses to requests for headers in rest_headers(), blocks in rest_block(), blockfilterheaders in rest_filter_header() blockfilters in rest_block_filter(), tx's in rest_tx() utxo's in rest_getutxos() and blockhashes in rest_blockhash_by_height().

In src/rpc/blockchain.cpp:

DataStream is used to serialize the block header in the getblockheader rpc
command:

    if (!fVerbose)
    {
        DataStream ssBlock{};
        ssBlock << pblockindex->GetBlockHeader();
        std::string strHex = HexStr(ssBlock);
        return strHex;
    }

and to deserialize the block data into a CBlock in the getblock rpc command:

    const std::vector<uint8_t> block_data{GetRawBlockChecked(chainman.m_blockman, *pblockindex)};

    DataStream block_stream{block_data};
    CBlock block{};
    block_stream >> TX_WITH_WITNESS(block);

    return blockToJSON(chainman.m_blockman, block, *tip, *pblockindex, tx_verbosity);

In src/rpc/mining.cpp:

DataStream is used by the generateblock rpc for serializing the output hex of a generated block when generateblock is called with submit=false:

    UniValue obj(UniValue::VOBJ);
    obj.pushKV("hash", block_out->GetHash().GetHex());
    if (!process_new_block) {
        DataStream block_ser;
        block_ser << TX_WITH_WITNESS(*block_out);
        obj.pushKV("hex", HexStr(block_ser));
    }

In src/rpc/rawtransaction.cpp:

DataStream is used to serialize the resulting PSBT's that get passed to EncodeBase64() and returned in combinepsbt:

static RPCHelpMan combinepsbt()
    // [ ..preparing merged_psbt.. ]

    DataStream ssTx{};
    ssTx << merged_psbt;
    return EncodeBase64(ssTx);

and finalizepsbt() which also might serialize the final transaction hex using a DataStream of TX_WITH_WITNESS(tx) passed to HexStr():

static RPCHelpMan finalizepsbt()
{
    // Unserialize the transactions
    PartiallySignedTransaction psbtx;
    std::string error;
    if (!DecodeBase64PSBT(psbtx, request.params[0].get_str(), error)) {
        throw JSONRPCError(RPC_DESERIALIZATION_ERROR, strprintf("TX decode failed %s", error));
    }

    bool extract = request.params[1].isNull() || (!request.params[1].isNull() && request.params[1].get_bool());

    CMutableTransaction mtx;
    bool complete = FinalizeAndExtractPSBT(psbtx, mtx);

    UniValue result(UniValue::VOBJ);
    DataStream ssTx{};
    std::string result_str;

    if (complete && extract) {
        ssTx << TX_WITH_WITNESS(mtx);
        result_str = HexStr(ssTx);
        result.pushKV("hex", result_str);
    } else {
        ssTx << psbtx;
        result_str = EncodeBase64(ssTx.str());
        result.pushKV("psbt", result_str);
    }
    result.pushKV("complete", complete);

    return result;
}

and in createpsbt:

static RPCHelpMan createpsbt()
{

    std::optional<bool> rbf;
    if (!request.params[3].isNull()) {
        rbf = request.params[3].get_bool();
    }
    CMutableTransaction rawTx = ConstructTransaction(request.params[0], request.params[1], request.params[2], rbf);

    // Make a blank psbt
    PartiallySignedTransaction psbtx;
    psbtx.tx = rawTx;
    for (unsigned int i = 0; i < rawTx.vin.size(); ++i) {
        psbtx.inputs.emplace_back();
    }
    for (unsigned int i = 0; i < rawTx.vout.size(); ++i) {
        psbtx.outputs.emplace_back();
    }

    // Serialize the PSBT
    DataStream ssTx{};
    ssTx << psbtx;

    return EncodeBase64(ssTx);
}

and in utxoupdatepsbt():

static RPCHelpMan utxoupdatepsbt()
{
    // Parse descriptors, if any.
    FlatSigningProvider provider;
    if (!request.params[1].isNull()) {
        auto descs = request.params[1].get_array();
        for (size_t i = 0; i < descs.size(); ++i) {
            EvalDescriptorStringOrObject(descs[i], provider);
        }
    }

    // We don't actually need private keys further on; hide them as a precaution.
    const PartiallySignedTransaction& psbtx = ProcessPSBT(
        request.params[0].get_str(),
        request.context,
        HidingSigningProvider(&provider, /*hide_secret=*/true, /*hide_origin=*/false),
        /*sighash_type=*/SIGHASH_ALL,
        /*finalize=*/false);

    DataStream ssTx{};
    ssTx << psbtx;
    return EncodeBase64(ssTx);
}

and joinpsbts:

static RPCHelpMan joinpsbts()
    // [ ... prepare PartiallySignedTransaction shuffled psbt ... ]
    DataStream ssTx{};
    ssTx << shuffled_psbt;
    return EncodeBase64(ssTx);
}

and in descriptorprocesspsbt, which like finalizepsbt above might also use DataStream for serializing a final transaction hex that gets passed to HexStr and return if the psbt is complete:

RPCHelpMan descriptorprocesspsbt()
    // [ ...prepare PartiallySignedTransaction &psbtx... ]
    DataStream ssTx{};
    ssTx << psbtx;

    UniValue result(UniValue::VOBJ);

    result.pushKV("psbt", EncodeBase64(ssTx));
    result.pushKV("complete", complete);
    if (complete) {
        CMutableTransaction mtx;
        PartiallySignedTransaction psbtx_copy = psbtx;
        CHECK_NONFATAL(FinalizeAndExtractPSBT(psbtx_copy, mtx));
        DataStream ssTx_final;
        ssTx_final << TX_WITH_WITNESS(mtx);
        result.pushKV("hex", HexStr(ssTx_final));
    }
    return result;
}

In src/rpc/txoutproof:

It is used for serializing the merkle inclusion proof in gettxoutproof():

static RPCHelpMan gettxoutproof()
{
    // [...]

    DataStream ssMB{};
    CMerkleBlock mb(block, setTxids);
    ssMB << mb;
    std::string strHex = HexStr(ssMB);
    return strHex;
}

and for deserializing the inclusion proof in verifytxoutproof:

static RPCHelpMan verifytxoutproof()
{
    DataStream ssMB{ParseHexV(request.params[0], "proof")};
    CMerkleBlock merkleBlock;
    ssMB >> merkleBlock;

    // [ ... Validate merkleBlock ... ] 
}

Wallet

If wallet is unencrypted on disk, I feel there is no reason for us to be delicate about how it is handled in memory.

How wallet disk encryption happens

My understanding of the way that wallet encryption on disk works is that keys and values are written and read by the wallet in crypted form, and they are decrypted/encrypted in memory by ScriptPubKeyMan, for example:

// [ Getting the private key for `CKeyID` address and storing the result 
//   in `CKey& keyOut` ]
bool LegacyDataSPKM::GetKey(const CKeyID &address, CKey& keyOut) const
{
    LOCK(cs_KeyStore);
    if (!m_storage.HasEncryptionKeys()) {
        return FillableSigningProvider::GetKey(address, keyOut);
    }

    // [ a map of crypted keys is created on legacy wallet load in
    //   `LoadLegacyWalletRecords()` ]
    CryptedKeyMap::const_iterator mi = mapCryptedKeys.find(address);
    if (mi != mapCryptedKeys.end())
    {
        const CPubKey &vchPubKey = (*mi).second.first;
        const std::vector<unsigned char> &vchCryptedSecret = (*mi).second.second;
        // [ Use the encryption key to decrypt the crypted key from the map. ]
        return m_storage.WithEncryptionKey([&](const CKeyingMaterial& encryption_key) {
            return DecryptKey(encryption_key, vchCryptedSecret, vchPubKey, keyOut);
        });
    }
    return false;
}

Because of this, we should not be vigilant about securing memory that contains crypted data from the disk.

In src/wallet/bdb.cpp:

BerkeleyDatabase::Rewrite() uses DataStream to serialize the keys and values from the existing db when rewriting the database.

BerkeleyDatabase::Rewrite() is used when encrypting a wallet for the first time, since, according to comments "BDB might keep bits of the unencrypted private key in slack space in the database file." or when we detect a wallet that was encrypted by version <0.5.0 and >0.4.0 of bitcoin, presumably because of some horrible bug in those versions. (PR #635

But at this point, the wallet has already been encrypted, and we won't be loading anything from slack space when rewriting the db, so no problems.

BerkeleyCursor::Next() is used when cursoring through the BDB, and stores the retrieved Key and Value in DataStream's, if the wallet is encrypted these will be crypted, if not, the keys are on disk in plaintext anyways.

BerkeleyBatch::ReadKey() retrieves the value for a given key in the database:

bool BerkeleyBatch::ReadKey(DataStream&& key, DataStream& value)
{
    if (!pdb)
        return false;

    SafeDbt datKey(key.data(), key.size());

    SafeDbt datValue;
    int ret = pdb->get(activeTxn, datKey, datValue, 0);
    if (ret == 0 && datValue.get_data() != nullptr) {
        value.clear();
        value.write(SpanFromDbt(datValue));
        return true;
    }
    return false;
}

This is not a concern because like above, this data is either in plaintext on disk, or it is being retrieved in crypted form and will be decrypted elsewhere by SPKM.

Similar arguments to the above apply for BerkeleyBatch::WriteKey(), BerkeleyBatch::EraseKey(), and BerkeleyBatch::HasKey()

In src/wallet/db.h:

The same argument as above applies for keys and values used here in DatabaseBatch functions Read, Write, Erase, Exists:

/** RAII class that provides access to a WalletDatabase */
class DatabaseBatch
{
private:
    virtual bool ReadKey(DataStream&& key, DataStream& value) = 0;
    virtual bool WriteKey(DataStream&& key, DataStream&& value, bool overwrite = true) = 0;
    virtual bool EraseKey(DataStream&& key) = 0;
    virtual bool HasKey(DataStream&& key) = 0;

public:
    template <typename K, typename T>
    bool Read(const K& key, T& value)
    {
        DataStream ssKey{};
        ssKey.reserve(1000);
        ssKey << key;

        DataStream ssValue{};
        if (!ReadKey(std::move(ssKey), ssValue)) return false;
        try {
            ssValue >> value;
            return true;
        } catch (const std::exception&) {
            return false;
        }
    }

    template <typename K, typename T>
    bool Write(const K& key, const T& value, bool fOverwrite = true)
    {
        DataStream ssKey{};
        ssKey.reserve(1000);
        ssKey << key;

        DataStream ssValue{};
        ssValue.reserve(10000);
        ssValue << value;

        return WriteKey(std::move(ssKey), std::move(ssValue), fOverwrite);
    }

    template <typename K>
    bool Erase(const K& key)
    {
        DataStream ssKey{};
        ssKey.reserve(1000);
        ssKey << key;

        return EraseKey(std::move(ssKey));
    }

    template <typename K>
    bool Exists(const K& key)
    {
        DataStream ssKey{};
        ssKey.reserve(1000);
        ssKey << key;

        return HasKey(std::move(ssKey));
    }
};

In dump.cpp:

DumpWallet() invoked by doing bitcoin-wallet dump prints all keys and values in a wallet, but does not decrypt them:

// [ I've editorialized this codeblock to focus on the part I'm interested in ]
bool DumpWallet(const ArgsManager& args, WalletDatabase& db, bilingual_str& error)
{
    // [.. handle dump file stuff ..]
    std::unique_ptr<DatabaseBatch> batch = db.MakeBatch();
    std::unique_ptr<DatabaseCursor> cursor = batch->GetNewCursor();

    // Read the records
    while (true) {
        DataStream ss_key{};
        DataStream ss_value{};
        DatabaseCursor::Status status = cursor->Next(ss_key, ss_value);
        if (status == DatabaseCursor::Status::DONE) {
            ret = true;
            break;
        } else if (status == DatabaseCursor::Status::FAIL) {
            error = _("Error reading next record from wallet database");
            ret = false;
            break;
        }
        std::string key_str = HexStr(ss_key);
        std::string value_str = HexStr(ss_value);
        line = strprintf("%s,%s\n", key_str, value_str);
        dump_file.write(line.data(), line.size());
        hasher << Span{line};
    }

    cursor.reset();
    batch.reset();

    // [.. handle dump file stuff ..]

    return ret;
}

In src/wallet/migrate.cpp & src/wallet/migrate.h:

BerkeleyRO* exist so that we can read keys and values from a legacy bdb wallet when migrating so that we can drop the bdb wallet entirely in the future, the same as in db.h applies here, all the ekys and values read in BerkeleyROBatch::ReadKey(), HasKey and BerkeleyROCursor::Next() are crypted as in their non-RO counterparts found above.

In src/wallet/rpc/backup.cpp:

DataStream is used to serialize the transaction inclusion proof argument to the importprunedfunds() rpc which lets pruned nodes import funds without rescanning if they have inclusion proofs similar to above in src/rpc/txoutproof.cpp.

RPCHelpMan importprunedfunds()
{
    std::shared_ptr<CWallet> const pwallet = GetWalletForJSONRPCRequest(request);
    if (!pwallet) return UniValue::VNULL;

    CMutableTransaction tx;
    if (!DecodeHexTx(tx, request.params[0].get_str())) {
        throw JSONRPCError(RPC_DESERIALIZATION_ERROR, "TX decode failed. Make sure the tx has at least one input.");
    }
    uint256 hashTx = tx.GetHash();

    DataStream ssMB{ParseHexV(request.params[1], "proof")};
    CMerkleBlock merkleBlock;
    ssMB >> merkleBlock;

    // [.. validate merkle block ..]

    // [.. add transactions to wallet.. ]
}

In src/wallet/rpc/txoutproof.cpp:

In static Univalue FinishTransaction used by the rpc's send() and sendall(), DataStream is used to serialize the completed psbt and print it if either was called with psbt=true.

In bumpfee_helper when invoked as the psbtbumpfee rpc, a DataStream is used to serialize the unsigned psbt of the new transaction that gets returned.

In walletprocesspsbt() `DataStream is used to serialize the PSBT, and if the transaction is complete to serialize the final transaction:

RPCHelpMan walletprocesspsbt()
{
    // [...prepare psbtx...]

    UniValue result(UniValue::VOBJ);
    DataStream ssTx{};
    ssTx << psbtx;
    result.pushKV("psbt", EncodeBase64(ssTx.str()));
    result.pushKV("complete", complete);
    if (complete) {
        CMutableTransaction mtx;
        // Returns true if complete, which we already think it is.
        CHECK_NONFATAL(FinalizeAndExtractPSBT(psbtx, mtx));
        DataStream ssTx_final;
        ssTx_final << TX_WITH_WITNESS(mtx);
        result.pushKV("hex", HexStr(ssTx_final));
    }

    return result;
}

in the walletcreatefundedpsbt rpc, it contains the serialized psbt

In src/wallet/salvage.cpp:

DataStream is used during RecoverDatabaseFile() when trying to recover key and value data from a db, nothing gets decrypted here:

    for (KeyValPair& row : salvagedData)
    {
        /* Filter for only private key type KV pairs to be added to the salvaged wallet */
        DataStream ssKey{row.first};
        DataStream ssValue(row.second);
        std::string strType, strErr;

        // We only care about KEY, MASTER_KEY, CRYPTED_KEY, and HDCHAIN types
        ssKey >> strType;
        bool fReadOK = false;
        // [ The below just load the crypted form of the key, no decryption. ]
        if (strType == DBKeys::KEY) {
            fReadOK = LoadKey(&dummyWallet, ssKey, ssValue, strErr);
        } else if (strType == DBKeys::CRYPTED_KEY) {
            fReadOK = LoadCryptedKey(&dummyWallet, ssKey, ssValue, strErr);
        } else if (strType == DBKeys::MASTER_KEY) {
            fReadOK = LoadEncryptionKey(&dummyWallet, ssKey, ssValue, strErr);
        } else if (strType == DBKeys::HDCHAIN) {
            fReadOK = LoadHDChain(&dummyWallet, ssValue, strErr);
        } else {
            continue;
        }

In src/wallet/sqlite.cpp & src/wallet/sqlite.h:

SQLiteBatch::ReadKey, WriteKey, etc. and SQLiteCursor::next mirror berkeley and berkeley RO batches above, again: all reading crypted data from disk, data gets decrypted somewhere else, once it's far away from it's humble DataStream beginnings.

In src/wallet/wallet.cpp:

Used in MigrateToSQLite() when iterating through BDB with the bdb cursor:

bool CWallet::MigrateToSQLite(bilingual_str& error)
{
    while (true) {
        DataStream ss_key{};
        DataStream ss_value{};
        status = cursor->Next(ss_key, ss_value);
        if (status != DatabaseCursor::Status::MORE) {
            break;
        }
        SerializeData key(ss_key.begin(), ss_key.end());
        SerializeData value(ss_value.begin(), ss_value.end());
        records.emplace_back(key, value);
    }
    cursor.reset();
    batch.reset();

    // [....insert the records in to the new sqlite db...] 
}

In src/wallet/walletdb.cpp:

Most of the arguments above about encrypted data on disk hold true here...

bool WalletBatch::IsEncrypted()
{
    DataStream prefix;
    prefix << DBKeys::MASTER_KEY;
    if (auto cursor = m_batch->GetNewPrefixCursor(prefix)) {
        DataStream k, v;
        if (cursor->Next(k, v) == DatabaseCursor::Status::MORE) return true;
    }
    return false;
}

master encryption keys are stored in the db (in crypted form!), this is just serializing the master key prefix and then searching for such an entry, no secrets in the prefix!

LoadKey and LoadCryptedKey don't do any decryption of the keys. LoadKey just grabs all the keys that have the unencrypted key prefix as-is, and loadcryptedkey loads keys with the crypted key prefix as-is. The story is almost identical with LoadHDChain and LoadEncryptionKey and the same with the rest of the LoadRecords(), LoadLegacyWalletRecoreds(), and LoadDescriptorWalletRecords() circus.

I definitely got tired and slacked a little while reviewing walletdb.cpp but I'm pretty confident about this.

In src/zmq/zmpqpublishnotifier.cpp:

bool CZMQPublishRawTransactionNotifier::NotifyTransaction(const CTransaction &transaction)
{
    uint256 hash = transaction.GetHash();
    LogDebug(BCLog::ZMQ, "Publish rawtx %s to %s\n", hash.GetHex(), this->address);
    DataStream ss;
    ss << TX_WITH_WITNESS(transaction);
    return SendZmqMessage(MSG_RAWTX, &(*ss.begin()), ss.size());
}`

Used to serialize the raw transaction that we are sending a ZeroMQ notification about.

Not done yet, `SerializeData`

Let's also look at every instance of SerializeData being used, since this is a vector of bytes, with the zero_after_free_allocator:

In src/wallet/migrate.cpp:

Used in the BerkeleyROBatch::* family of ReadKey(), HasKey() to represent the vector portion of the same DataStream's I used and described above that have just crypted key data, or unencrypted data if the wallet itself is unencrypted, e.g.:

bool BerkeleyROBatch::ReadKey(DataStream&& key, DataStream& value)
{
    SerializeData key_data{key.begin(), key.end()};
    const auto it{m_database.m_records.find(key_data)};
    if (it == m_database.m_records.end()) {
        return false;
    }
    auto val = it->second;
    value.clear();
    value.write(Span(val));
    return true;
}

In src/wallet/wallet.cpp:

Used in MigrateToSQLite() as discussed above to store the DataStream data described above:

while (true) {
    DataStream ss_key{};
    DataStream ss_value{};
    status = cursor->Next(ss_key, ss_value);
    if (status != DatabaseCursor::Status::MORE) {
        break;
    }
    SerializeData key(ss_key.begin(), ss_key.end());
    SerializeData value(ss_value.begin(), ss_value.end());
    records.emplace_back(key, value);
}

Maybe as much as 50x: https://github.com/bitcoin/bitcoin/pull/740#issuecomment-3337245 ↩
I am assuming this from the discussion, github seems to not have dead commits for old pr's ↩

davidgumberg · 2024-09-27T18:41:33Z

The CI failure on ARM is related, and I am able to reproduce locally. It is from a -Warray-bounds warning that I believe is spurious, I'm trying to create a minimal reproduction.

Sjors · 2024-10-01T11:55:48Z

I tested this PR on an AMD Ryzen 7950x machine with Ubuntu 24.04, with one local network peer and a gigabit internet connection.

bitcoind -dbcache=30000 -stopatheight=863000 -blocksdir=/magnetic/.bitcoin -addnode=local-network

Before: 5 hours 10 minutes
After: 4 hours and 55 minutes

Time includes 20 minutes to flush chainstate to disk during shutdown.

That's nowhere near a 25% difference and probably not statistically significant. Some configurations might do better than others from this change. It's certainly not worse.

The node was additionally patched to drop the -par limit and use 31 threads for signature validation (in both runs).

theuni · 2024-10-01T15:54:39Z

@davidgumberg Thanks for the very detailed description!

I suspect the speedup here is going to be very dependent on the architecture/environment, but it's not clear to me exactly what variables would matter most.

Would you be interested in working up a specific zero-after-free benchmark that illustrates the difference? If we knew exactly where the speedup was coming from, we could make a more informed decision about what/where to optimize.

davidgumberg · 2024-10-01T21:49:07Z

I tested this PR on an AMD Ryzen 7950x machine with Ubuntu 24.04, with one local network peer and a gigabit internet connection.
bitcoind -dbcache=30000 -stopatheight=863000 -blocksdir=/magnetic/.bitcoin -addnode=local-network
Before: 5 hours 10 minutes After: 4 hours and 55 minutes

Time includes 20 minutes to flush chainstate to disk during shutdown.

@Sjors Thanks for testing and showing that the benefit (if any) here is setup-dependent. I suspect that the biggest improvements will be seen on memory bandwidth constrained systems and when flushing/syncing coinsdb to disk, which you have skipped most of by running such a high dbcache and unpruned, so I think that your setup is worst case scenario for improvements seen from this change but this could be my own wishful thinking!

I suspect the speedup here is going to be very dependent on the architecture/environment, but it's not clear to me exactly what variables would matter most.

Would you be interested in working up a specific zero-after-free benchmark that illustrates the difference? If we knew exactly where the speedup was coming from, we could make a more informed decision about what/where to optimize.

@theuni Good point, especially given Sjors result, I will draft up a zero-after-free benchmark to isolate the causes / relevant factors of any performance benefit that might be here. I think a CDBWrapper::BatchWrite() benchmark could also be helpful for this and other PR's like #30039 that claim to improve IBD write performance.

The removed passage, introduced with this benchmark in PR#16267(bitcoin#16267) appears to have been copied and pasted from the earlier block tests in `bench/checkblock.cpp`. (bitcoin#9049) There, it is relevant to prevent triggering what seems to be a vestigial branch of DataStream::Rewind() related to the unused DataStream::Compact(). While harmless, it is removed because it can trigger a spurious bounds warning in GCC <12.3 & <11.4. This issue was previously worked around in c78d8ff (PR#30765). GCC Bugzilla issue: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100366

Avoid using BOOST_CHECK_EQUAL_COLLECTIONS for two std::vector<std::pair<SerializeData, SerializeData>> since boost expects printing methods that are very brittle. See: * https://www.boost.org/doc/libs/1_86_0/libs/test/doc/html/boost_test/test_output/test_tools_support_for_logging/testing_tool_output_disable.html * https://stackoverflow.com/questions/10976130/boost-check-equal-with-pairint-int-and-custom-operator * https://stackoverflow.com/questions/3999644/how-to-compare-vectors-with-boost-test * boostorg/type_traits#196

`SerializeData` is used by `DataStream` which is used throughout the codebase for non-secret data. Originally introduced as a mitigation for buffer-overflows. The potential mitigation is not worth the performance cost. This slows down IBD by as much as ~25%.

davidgumberg · 2024-10-10T20:12:21Z

Fixed the spurious array bounds warning that occurs on Debian because it uses GCC 12.2, which has a bug where some uses of std::vector::insert() result in an incorrect array bounds warning, this issue was previously discussed in #30765. (See: c78d8ff)

As suggested by @theuni, I've added some benchmarks that help show where the performance improvement is coming from. In a test where a 1000-byte CScript is serialized into a DataStream, a Ryzen 7900x machine (5200 MT/s DDR5) serializes ~23.33GB/s on my branch and ~4.58 GB/s on master, and a Raspberry Pi 5 4GB serializes ~7.03GB/s on my branch and ~5.42GB/s on master.

I also made a branch(davidgumberg@c832fed) with a version of the zero after free allocator that keeps the compiler optimization prevention, but doesn't actually memset the stream to zero, and performance in some cases is only slightly better than master. For example, in the same test as above, it managed ~4.72 GB/s on the 7900x, on the Raspberry Pi performance of this "partial zero-after-free" branch was closer to my no-zeroing branch, getting ~6.95GB/s. This seems to hint that a large part of the performance issue here isn't just from zeroing memory with memset(): it's other compiler optimizations prevented by what we do to prevent memset() from being optimized out.

I ran the benchmarks on three devices, and the data is below, the most curious result is from the Ryzen 7640U w/ 5600 MT/s memory, which showed the least improvement between the master, "partial zero", and my branch. Repeated runs were noisy, I used pyperf system tune to try to stabilize these results, but I think there is inherent thermal instability in that device's laptop form factor. Worth pointing out is that the 7640U has worse compute performance than the 7900x but the 7640U has faster memory. ¹ So this may just be a consequence of the 7640U being the least memory-bandwidth-constrained of the three devices.

Benchmark Results

Raspberry Pi 5 4GB

Original zero-after-free allocator still in use with DataStream

~/bitcoin $ git checkout --detach $yeszero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 772b1f606f test: avoid BOOST_CHECK_EQUAL for complex types

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
20.67	48,369,994.69	0.6%	61.83	47.04	1.314	8.87	0.8%	68.71	`CCoinsViewDBFlush`
0.86	1,165,414,983.28	0.0%	5.14	2.06	2.498	1.04	0.0%	66.01	`DataStreamAlloc`
0.18	5,416,728,210.85	0.1%	1.26	0.44	2.839	0.25	0.0%	66.00	`DataStreamSerializeScript`
9.06	110,322,628.58	0.1%	32.40	21.69	1.493	6.06	0.7%	66.09	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
4,319,764.64	231.49	0.2%	21,518,823.54	10,347,746.41	2.080	3,965,023.40	1.1%	64.06	`DeserializeAndCheckBlockTest`
2,983,304.65	335.20	0.1%	14,726,319.41	7,146,940.53	2.061	2,622,747.10	0.7%	66.04	`DeserializeBlockTest`

Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.

~/bitcoin $ git checkout --detach $partzero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 0351c4242a Modify zero after free allocator to prevent optimizations without zeroing memory

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
20.64	48,461,301.71	0.5%	61.84	46.60	1.327	8.87	0.8%	68.28	`CCoinsViewDBFlush`
0.84	1,183,775,230.65	0.0%	5.08	2.03	2.505	1.02	0.0%	66.02	`DataStreamAlloc`
0.14	6,951,563,016.33	0.0%	1.13	0.35	3.273	0.21	0.0%	66.00	`DataStreamSerializeScript`
9.45	105,798,798.06	0.3%	46.75	22.67	2.062	8.46	0.5%	66.14	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
4,172,021.00	239.69	0.1%	21,543,066.84	9,993,817.02	2.156	3,988,350.18	1.0%	63.92	`DeserializeAndCheckBlockTest`
2,919,977.25	342.47	0.0%	14,750,310.48	6,994,754.12	2.109	2,646,087.06	0.5%	66.07	`DeserializeBlockTest`

My PR branch with no zero-after-free allocator:

~/bitcoin $ git checkout --detach $nozero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 906e67b951 refactor: Drop unused `zero_after_free_allocator`

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
20.89	47,868,766.30	0.7%	60.74	47.24	1.286	9.12	0.9%	69.52	`CCoinsViewDBFlush`
0.04	27,639,502,423.73	0.0%	0.20	0.09	2.312	0.04	0.0%	66.02	`DataStreamAlloc`
0.14	7,030,720,015.31	0.0%	1.09	0.34	3.203	0.22	0.0%	66.03	`DataStreamSerializeScript`
8.46	118,171,923.30	0.1%	29.40	20.25	1.452	5.06	0.8%	66.06	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
4,111,234.73	243.24	0.1%	21,519,664.21	9,847,208.26	2.185	3,965,210.98	1.0%	63.80	`DeserializeAndCheckBlockTest`
2,857,220.97	349.99	0.1%	14,727,090.03	6,843,201.05	2.152	2,622,831.00	0.5%	65.95	`DeserializeBlockTest`

Ryzen 7900x 5200 MT/s DDR5

Original zero-after-free allocator still in use with DataStream

~/bitcoin$ git checkout --detach $yeszero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 772b1f606f test: avoid BOOST_CHECK_EQUAL for complex types

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
6.14	162,782,032.39	0.7%	56.96	27.77	2.051	8.25	0.7%	61.54	`CCoinsViewDBFlush`
0.19	5,280,744,677.81	0.1%	5.10	0.89	5.755	1.02	0.0%	65.93	`DataStreamAlloc`
0.22	4,577,202,378.38	0.5%	5.70	1.02	5.579	1.16	0.1%	66.27	`DataStreamSerializeScript`
2.37	422,778,468.05	0.2%	32.39	11.06	2.929	5.12	0.6%	66.04	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
1,319,284.06	757.99	0.4%	20,617,084.61	6,164,538.66	3.344	3,706,003.42	0.7%	65.82	`DeserializeAndCheckBlockTest`
879,982.73	1,136.39	0.4%	14,213,986.82	4,113,201.90	3.456	2,432,431.24	0.2%	65.87	`DeserializeBlockTest`

Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.

~/btc/bitcoin$ git checkout --detach $partzero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 3bdd43680e Modify zero after free allocator to prevent optimizations without zeroing memory

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
6.24	160,226,428.51	0.5%	56.96	27.99	2.035	8.25	0.7%	62.34	`CCoinsViewDBFlush`
0.18	5,415,824,062.30	0.1%	5.07	0.86	5.869	1.02	0.0%	65.99	`DataStreamAlloc`
0.21	4,715,585,681.78	0.1%	5.62	0.99	5.664	1.14	0.1%	65.93	`DataStreamSerializeScript`
2.36	424,307,427.06	0.1%	32.36	11.02	2.938	5.12	0.6%	66.07	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
1,304,195.07	766.76	0.1%	20,615,353.83	6,096,229.68	3.382	3,705,797.43	0.7%	66.01	`DeserializeAndCheckBlockTest`
876,218.51	1,141.27	0.0%	14,212,309.42	4,095,993.88	3.470	2,431,660.20	0.2%	65.98	`DeserializeBlockTest`

My PR branch with no zero-after-free allocator:

~/btc/bitcoin$ git checkout --detach $nozero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 906e67b951 refactor: Drop unused `zero_after_free_allocator`

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
6.24	160,367,026.40	0.8%	57.70	28.03	2.059	8.46	0.7%	62.47	`CCoinsViewDBFlush`
0.01	113,328,653,394.82	0.0%	0.12	0.04	2.854	0.02	0.0%	65.69	`DataStreamAlloc`
0.04	23,329,286,239.78	0.0%	0.89	0.20	4.454	0.19	0.0%	64.00	`DataStreamSerializeScript`
2.26	441,734,425.78	0.1%	29.88	10.58	2.825	4.62	0.6%	65.89	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
1,302,825.68	767.56	0.2%	20,617,190.29	6,090,178.32	3.385	3,706,032.36	0.7%	65.93	`DeserializeAndCheckBlockTest`
874,097.45	1,144.04	0.1%	14,212,631.31	4,085,149.78	3.479	2,431,804.86	0.2%	66.24	`DeserializeBlockTest`

Ryzen 5 7640U 5600 MT/s DDR5

This run was done with a slightly updated version of CCoinsViewDBFlush from the above runs.

Original zero-after-free allocator still in use with DataStream

~/bitcoin$ git checkout --detach $yeszero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at 970e7822d4 test: avoid BOOST_CHECK_EQUAL for complex types`

ns/coin	coin/s	err%	ins/coin	cyc/coin	IPC	bra/coin	miss%	total	benchmark
1,537.95	650,216.26	1.0%	8,199.37	5,151.04	1.592	1,377.36	0.8%	67.93	`CCoinsViewDBFlush`

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
0.02	44,829,302,792.99	0.2%	0.16	0.08	2.009	0.03	0.0%	65.87	`DataStreamAlloc`
0.08	12,714,010,775.32	0.1%	1.25	0.27	4.579	0.24	0.0%	66.38	`DataStreamSerializeScript`
3.74	267,485,270.13	0.6%	31.97	12.92	2.474	5.01	0.6%	63.25	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
2,157,061.67	463.59	0.8%	21,937,911.43	7,472,463.20	2.936	3,976,431.11	0.7%	66.21	`DeserializeAndCheckBlockTest`
1,523,202.16	656.51	0.5%	16,402,554.71	5,276,345.85	3.109	2,930,545.76	0.2%	66.23	`DeserializeBlockTest`

Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.

~/bitcoin$ git checkout --detach $partzero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at c832feda63 Modify zero after free allocator to prevent optimizations without zeroing memory

ns/coin	coin/s	err%	ins/coin	cyc/coin	IPC	bra/coin	miss%	total	benchmark
1,558.58	641,609.73	0.4%	8,200.70	5,210.43	1.574	1,377.66	0.8%	69.12	`CCoinsViewDBFlush`

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
0.01	71,945,612,656.56	0.1%	0.12	0.05	2.567	0.02	0.0%	65.35	`DataStreamAlloc`
0.06	17,044,987,379.75	0.3%	1.16	0.20	5.685	0.21	0.0%	66.07	`DataStreamSerializeScript`
3.67	272,659,024.97	0.3%	31.94	12.71	2.514	5.01	0.6%	63.38	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
2,131,937.90	469.06	0.2%	21,935,850.89	7,404,122.77	2.963	3,975,866.20	0.7%	66.04	`DeserializeAndCheckBlockTest`
1,516,657.38	659.34	0.3%	16,397,963.21	5,259,062.71	3.118	2,929,264.57	0.2%	66.02	`DeserializeBlockTest`

My PR branch with no zero-after-free allocator:

~/bitcoin$ git checkout --detach $nozero && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release &>/dev/null && cmake --build build -j $(nproc) &>/dev/null && ./build/src/bench/bench_bitcoin -filter="(DataStream.*|CCoinsViewDB.*|ProcessMessage.*|Deserial.*)" -min-time=60000
HEAD is now at b5fee2fd09 refactor: Drop unused `zero_after_free_allocator`

ns/coin	coin/s	err%	ins/coin	cyc/coin	IPC	bra/coin	miss%	total	benchmark
1,504.51	664,666.35	0.9%	7,902.94	5,023.82	1.573	1,342.60	0.8%	66.38	`CCoinsViewDBFlush`

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
0.01	75,642,383,126.35	0.3%	0.12	0.05	2.695	0.02	0.0%	65.00	`DataStreamAlloc`
0.05	18,357,256,774.43	0.2%	0.91	0.19	4.800	0.19	0.0%	66.02	`DataStreamSerializeScript`
3.65	273,613,395.28	0.0%	31.94	12.70	2.515	5.01	0.6%	63.30	`ProcessMessageBlock`

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
2,127,133.96	470.12	0.4%	21,940,668.03	7,392,562.32	2.968	3,976,863.37	0.7%	66.20	`DeserializeAndCheckBlockTest`
1,512,064.24	661.35	0.4%	16,399,530.70	5,255,082.56	3.121	2,929,551.27	0.2%	65.85	`DeserializeBlockTest`

I'm not very knowledgeable about memory performance, but I suspect the Ryzen 7640U device's memory is faster for reasons beyond the "max bandwidth". (5200 MT/s for 7900x, 5600 MT/s for the 7640U) I don't have a reference for this but I believe that the 7640U has a one generation newer memory controller than the Ryzen 7900x, and anecdotally I see better performance doing LLM inference on the CPU of the 7640U, which is a memory bandwidth bound workload. ↩

theuni · 2024-10-10T21:21:06Z

I also made a branch(davidgumberg@c832fed) with a version of the zero after free allocator that keeps the compiler optimization prevention, but doesn't actually memset the stream to zero, and performance in some cases is only slightly better than master.

This is an interesting (and expected, I suppose) takeaway. Sadly, it suggests that there's really nothing that we can do to optimize our implementation. I looked around at clang/gcc/glibc/llvm-libc to see if there are any other ways of handling this, but they all resort to the same memory barrier trick.

This is definitely interesting enough to consider disabling selectively, though I'm not convinced we should just nuke it everywhere.

I know you've already collected a good bit of data on this, but it's still not clear to me exactly why this is speeding up IBD. It could be, for example, that the net messages account for 90% of the memory_cleanse() calls.

Would you be up for creating a callgraph/flamegraph which shows the hierarchy for these calls?

l0rinc · 2024-10-11T09:41:29Z

I have compared the full IBD speed of

(before) 0449a22: test: avoid BOOST_CHECK_EQUAL for complex types
(after) 5cf2fef: refactor: Drop unused zero_after_free_allocator

The added test served as a baseline, dropping the zero fee allocator as the purpose of this PR (I know you've added new commits since, let me know if you think that changes the landscape).

I've used a low, but reasonable 2GB dbcache for the first 800k blocks to measure the underlying database instead of a single final dump with real nodes (which is surprisingly stable, given enough blocks).

I ran it on a Hetzner HDD, the results are unfortunately not very promising:

benchmark

hyperfine --runs 1 \
--export-json /mnt/ibd_full-zero_after_free_allocator_change.json \
--parameter-list COMMIT 0449a22bc0bcd610a898ec921af30175e2b34757,5cf2fefd33907c48f79e444032d79f7c889345d8 \
--prepare 'git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc) && rm -rf /mnt/BitcoinData/*' \
'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0'

Benchmark 1: COMMIT=0449a22bc0bcd610a898ec921af30175e2b34757 ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0
  Time (abs ≡):        33806.099 s               [User: 27353.365 s, System: 4255.723 s]

Benchmark 2: COMMIT=5cf2fefd33907c48f79e444032d79f7c889345d8 ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0
  Time (abs ≡):        33978.406 s               [User: 27050.874 s, System: 4283.780 s]

Summary
  COMMIT=0449a22bc0bcd610a898ec921af30175e2b34757 ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0 ran
    1.01 times faster than COMMIT=5cf2fefd33907c48f79e444032d79f7c889345d8 ./build/src/bitcoind -datadir=/mnt/BitcoinData -stopatheight=800000 -dbcache=2000 -printtoconsole=0

so basically this change didn't seem to affect IBD speed at all.
I know that LevelDB is not optimized for HDD, but SSDs are already fast enough.
Am I measuring something incorrectly?

l0rinc · 2024-10-13T08:09:15Z

I reran it on the same platform until 840000 with 1GB dbcache with the latest commits.

benchmark

hyperfine \
--runs 1 \
--export-json /mnt/my_storage/ibd_full-zero_after_free_allocator_change.json \
--parameter-list COMMIT 1e096b30da808d6b0691f5cde14c529e1c85ff1b,906e67b95157fd557438c37b3085cf5dec2ae135 \
--prepare 'git checkout {COMMIT} && git clean -fxd && git reset \
--hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake \
--build build -j$(nproc) && rm -rf /mnt/my_storage/BitcoinData/*' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0'

I've compared these two:

(before) 1e096b3 test: refactor: Add RandScript utility
(after) 906e67b refactor: Drop unused zero_after_free_allocator

results:

Benchmark 1: COMMIT=1e096b30da808d6b0691f5cde14c529e1c85ff1b ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0
  Time (abs ≡):        41489.685 s               [User: 35294.609 s, System: 6578.215 s]

Benchmark 2: COMMIT=906e67b95157fd557438c37b3085cf5dec2ae135 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0
  Time (abs ≡):        42348.064 s               [User: 35077.894 s, System: 7019.189 s]

Summary
  COMMIT=1e096b30da808d6b0691f5cde14c529e1c85ff1b ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0 ran
    1.02 times faster than COMMIT=906e67b95157fd557438c37b3085cf5dec2ae135 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=840000 -dbcache=1000 -printtoconsole=0

Which seems to indicate that there wasn't any speedup after this change.

laanwj · 2024-10-15T08:43:25Z

Concept ACK. Using zero-after-free allocators for all CDataStream usage is overkill. i don't think there are any, but if there are places where it is used (wallet) where security against key leaks is important, we could parametrize CDataStream<AllocatorType> and pass a secure_allocator. Agree that zeroing adding extra security against buffer overflows is a red herring.

i expect this can make a significant differences on systems with relatively slow CPU or memory, like ARM systems (will run some benchmarks).

Any objects that contains secrets should not be allocated using zero_after_free_allocator since they are liable to get mapped to swap space and written to disk if the user is running low on memory

Exactly. We have secure_allocator that creates a pool of non-paged memory, as well as zeros on deallocation, zero_after_free_allocators insufficient for this purpose. This is very inefficient, though, so should be used with care.

Edit: Benchmarked on a Rpi5 with NVME hat, synchronizing from a node directly connected over a 1Gbit network ,
compiler: gcc (Debian 12.2.0-14) 12.2.0
bitcoin running with default dbcache setting (450), -connect=192.168.1.x:8333 -nolisten -stopatheight=800000

0c2c3bb3f5c6f52c8db625c3edb51409c72c14b0  Base

    2024-10-16T08:14:13Z UpdateTip: new best=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f height=0 version=0x00000001 log2_work=32.000022 tx=1 date='2009-01-03T18:15:05Z' progress=0.000000 cache=0.3MiB(0txo)
    2024-10-16T18:56:56Z UpdateTip: new best=00000000000000000002a7c4c1e48d76c5a37902165a270156b7a8d72728a054 height=800000 version=0x341d6000 log2_work=94.318003 tx=868965226 date='2023-07-24T03:17:09Z' progress=0.792864 cache=508.9MiB(3772216txo)

    >>> (datetime.datetime.fromisoformat('2024-10-16T18:56:56') - datetime.datetime.fromisoformat('2024-10-16T08:14:13')).seconds
    38563

906e67b95157fd557438c37b3085cf5dec2ae135  This PR

    2024-10-15T10:39:43Z UpdateTip: new best=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f height=0 version=0x00000001 log2_work=32.000022 tx=1 date='2009-01-03T18:15:05Z' progress=0.000000 cache=0.3MiB(0txo)
    2024-10-15T21:16:02Z UpdateTip: new best=00000000000000000002a7c4c1e48d76c5a37902165a270156b7a8d72728a054 height=800000 version=0x341d6000 log2_work=94.318003 tx=868965226 date='2023-07-24T03:17:09Z' progress=0.793244 cache=508.9MiB(3772216txo)

>>> (datetime.datetime.fromisoformat('2024-10-16T18:56:56') - datetime.datetime.fromisoformat('2024-10-16T08:14:13')).seconds
38563
>>> (datetime.datetime.fromisoformat('2024-10-15T21:16:02') - datetime.datetime.fromisoformat('2024-10-15T10:39:43')).seconds
38179

It is faster but only by roughly 1%.

davidgumberg · 2024-10-24T03:07:08Z

Given that others have been unable to reproduce the results that led me to open this PR, I am moving to draft until I better understand either what has gone wrong in my measurements or what setups reproduce the result I got.

DrahtBot · 2025-02-28T13:37:11Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/31335813395}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

l0rinc · 2025-02-28T14:23:21Z

I have noticed that in debug builds the zeroing is a big part of the IBD flame graphs, but in release mode it's almost completely eliminated.

l0rinc · 2025-03-11T13:22:49Z

While the performance increase likely isn't a measurable, it could still make sense to remove the memory zeroing - at least for most serializations, if for no other reason, it's doing useless work and it's easier to understand the behavior without it.
For cases where we're not sure if zeroing makes sense, we could keep the old behavior - as suggested by @laanwj - making this a cleanup/refactor, rather than an optimization.

DrahtBot · 2025-03-20T10:18:14Z

🐙 This pull request conflicts with the target branch and needs rebase.

DrahtBot · 2025-06-17T01:39:28Z

⌛ There hasn't been much activity lately and the patch still needs rebase. What is the status here?

Is it still relevant? ➡️ Please solve the conflicts to make it ready for review and to ensure the CI passes.
Is it no longer relevant? ➡️ Please close.
Did the author lose interest or time to work on this? ➡️ Please close it and mark it 'Up for grabs' with the label, so that it can be picked up in the future.

DrahtBot added the CI failed label Sep 26, 2024

davidgumberg force-pushed the zero_after_free_allocator_change branch from d90eb00 to 5cf2fef Compare September 27, 2024 01:20

DrahtBot mentioned this pull request Sep 28, 2024

util: explicitly close all AutoFiles that have been written #29307

Merged

davidgumberg changed the title ~~Don't zero-after-free DataStream: ~25% faster IBD~~ Don't zero-after-free DataStream: ~25% faster IBD on some configurations Oct 1, 2024

davidgumberg changed the title ~~Don't zero-after-free DataStream: ~25% faster IBD on some configurations~~ Don't zero-after-free DataStream: Faster IBD on some configurations Oct 3, 2024

DrahtBot mentioned this pull request Oct 6, 2024

test: Fix copy-paste in wallet/test/db_tests ostream operator #31038

Merged

DrahtBot added the Needs rebase label Oct 8, 2024

davidgumberg force-pushed the zero_after_free_allocator_change branch 2 times, most recently from 10a5836 to d12a149 Compare October 9, 2024 01:55

DrahtBot removed the Needs rebase label Oct 9, 2024

davidgumberg force-pushed the zero_after_free_allocator_change branch from d12a149 to 1e6cb11 Compare October 9, 2024 02:19

DrahtBot mentioned this pull request Oct 9, 2024

p2p: Fill reconciliation sets (Erlay) attempt 2 #30116

Open

davidgumberg force-pushed the zero_after_free_allocator_change branch 2 times, most recently from d146017 to 3c5d8d4 Compare October 10, 2024 03:14

davidgumberg added 5 commits October 9, 2024 22:57

test: refactor: Add RandScript utility

1e096b3

bench: Add datastream benchmark

4c30819

bench: Add ProcessMessage NetMsgType::BLOCK bench

59d105e

test: Randomize AddTestCoin height and SPK

aec0491

bench: Add CCoinsViewDB flush test

ac231b3

davidgumberg force-pushed the zero_after_free_allocator_change branch from 3c5d8d4 to 0a3bae7 Compare October 10, 2024 05:57

davidgumberg added 3 commits October 9, 2024 23:27

refactor: Drop unused zero_after_free_allocator

906e67b

davidgumberg force-pushed the zero_after_free_allocator_change branch from 0a3bae7 to 906e67b Compare October 10, 2024 06:28

DrahtBot removed the CI failed label Oct 10, 2024

laanwj added Utils/log/libs Resource usage labels Oct 20, 2024

davidgumberg marked this pull request as draft October 24, 2024 03:07

DrahtBot mentioned this pull request Oct 26, 2024

bench: Remove various extraneous benchmarks #31153

Closed

DrahtBot mentioned this pull request Dec 18, 2024

refactor: Use std::span over Span #31519

Merged

Sjors mentioned this pull request Jan 7, 2025

[IBD] multi-byte block obfuscation #31144

Merged

DrahtBot mentioned this pull request Feb 10, 2025

p2p: improve TxOrphanage denial of service bounds #31829

Merged

DrahtBot added the CI failed label Feb 28, 2025

DrahtBot mentioned this pull request Mar 10, 2025

[IBD] specialize block serialization #31868

Draft

DrahtBot added the Needs rebase label Mar 20, 2025

Don't zero-after-free DataStream: Faster IBD on some configurations #30987

Are you sure you want to change the base?

Don't zero-after-free DataStream: Faster IBD on some configurations #30987

Uh oh!

Conversation

davidgumberg commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Are any secrets stored in DataStream that will lose security?

Zero-after-free as a buffer overflow mitigation

Other notes

Footnotes

Uh oh!

DrahtBot commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

Uh oh!

DrahtBot commented Sep 26, 2024

Uh oh!

davidgumberg commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Appendix

Benchmarks

Master run 1

Master run 2

Branch run 1

Branch run 2

Historical background

All uses of DataStream and SerializeData

DataStream

Wallet

How wallet disk encryption happens

Not done yet, SerializeData

Footnotes

Uh oh!

davidgumberg commented Sep 27, 2024

Uh oh!

Sjors commented Oct 1, 2024

Uh oh!

theuni commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidgumberg commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidgumberg commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Raspberry Pi 5 4GB

Original zero-after-free allocator still in use with DataStream

Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.

My PR branch with no zero-after-free allocator:

Ryzen 7900x 5200 MT/s DDR5

Original zero-after-free allocator still in use with DataStream

Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.

My PR branch with no zero-after-free allocator:

Ryzen 5 7640U 5600 MT/s DDR5

Original zero-after-free allocator still in use with DataStream

Modified zero-after-free allocator that prevents memory optimization but doesn't zero memory.

My PR branch with no zero-after-free allocator:

Footnotes

Uh oh!

theuni commented Oct 10, 2024

Uh oh!

l0rinc commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l0rinc commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laanwj commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidgumberg commented Oct 24, 2024

Uh oh!

DrahtBot commented Feb 28, 2025

Uh oh!

l0rinc commented Feb 28, 2025

Uh oh!

Don't zero-after-free `DataStream`: Faster IBD on some configurations #30987

Don't zero-after-free `DataStream`: Faster IBD on some configurations #30987

davidgumberg commented Sep 26, 2024 •

edited

Loading

Are any secrets stored in `DataStream` that will lose security?

DrahtBot commented Sep 26, 2024 •

edited

Loading

davidgumberg commented Sep 27, 2024 •

edited

Loading

`DataStream`

Not done yet, `SerializeData`

theuni commented Oct 1, 2024 •

edited

Loading

davidgumberg commented Oct 1, 2024 •

edited

Loading

davidgumberg commented Oct 10, 2024 •

edited

Loading

l0rinc commented Oct 11, 2024 •

edited

Loading

l0rinc commented Oct 13, 2024 •

edited

Loading

laanwj commented Oct 15, 2024 •

edited

Loading