Skip to content

[SIMD-0321]: VM Register 2 Instruction Data Pointer #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

buffalojoec
Copy link
Contributor

Provide a pointer to instruction data in VM register 2 (r2) at program
entrypoint, enabling direct access to instruction data without parsing the
serialized input region.

@buffalojoec buffalojoec changed the title [SIMD-0320]: VM Register 2 Instruction Data Pointer [SIMD-0321]: VM Register 2 Instruction Data Pointer Jul 11, 2025
@buffalojoec buffalojoec force-pushed the r2-instruction-data-ptr branch from 61ac6c6 to a6a212a Compare July 11, 2025 00:58
@buffalojoec
Copy link
Contributor Author

Thanks for the review @deanmlittle! I've added a commit with your suggestions, adapted in some places.

Comment on lines +66 to +67
* The pointer in `r2` points to the first byte of the actual instruction data,
NOT the length field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not point it to the length?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it should be pointing to the length field and that could then be offset to get the actual data pointer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a bad idea that complicates everything downstream from it. The major benefit of the r2 upgrade is the ability to statically derive instruction data from fixed offsets. Length is typically only accessed once and can be derived from an offset of -8. Conversely, having to add 8 all of your static offsets or burn an additional register/cu to store another pointer makes no sense.

Copy link
Contributor Author

@buffalojoec buffalojoec Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, originally I had length as well, but a few people in the discussion requested we point to the data. There are also areas where we subtract 8 to obtain a length in the SDK/tooling already, so this pattern isn't too unusual.

https://github.com/anza-xyz/solana-sdk/blob/df57ce256ba2b427c9e480d7e5e5993a2b7414bf/account-info/src/lib.rs#L170

Ultimately, it makes no difference to me. I think it should be whatever the people who will use it think is best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversely, having to add 8 all of your static offsets or burn an additional register/cu to store another pointer makes no sense.

I mentioned this on the discussion @buffalojoec pointed. I'm not confident the gains will be noticeable. r2 stores function call arguments, so you'll lose the reference to the pointer as soon as you call a function. In that case, it won't make a difference if you have r2 pointing directly to the data or you've already offset it and stored it elsewhere. Either you'll have a spill or you'll use another register.

If you design an entry point without any function call, you may see a difference. But again, if this entry point has enough operations, you'll need to spill the value to the stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my proof-of-concept I demonstrated how a program can simply define its entrypoint to receive the r2 parameter, which would turn it into a stack variable, no?

#[no_mangle]
pub unsafe extern "C" fn entrypoint(
    input: *mut u8,
    instruction_data_offset: u64,
) -> u64 {
    let _ = instruction_data_offset;
    /* ... */
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. It will only be stored in the stack when needed. In a very simple function that needs no stack space, it will stay in a register. These cases aren't so common, though.

I saw your example mentions instruction_data_offset. It is supposed to be the pointer, right? not the offset from the starting pointer.


1. **Provide a pointer to instruction data length**: Store a pointer to the
instruction data length field in `r2`. However, providing a direct pointer to
the start of instruction data is more ergonomic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

providing a direct pointer to the start of instruction data is more ergonomic.

Why is that? If I were to deserialize, I would expect to see the length first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider reading up on how our existing Rust SDKs work: https://github.com/anza-xyz/solana-sdk/blob/master/account-info/src/lib.rs#L170

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the point for discussion is whether it is wrong or right to subtract the pointer. The account info point you just mentioned needs the length first: https://github.com/anza-xyz/solana-sdk/blob/d708b8a829743013accc5f537709b5dd7b7fbc91/program-entrypoint/src/lib.rs#L429-L438.

Since in most cases, you'll need to validate the data pointer to ensure you are not reading garbage or doing an out of bounds read, you'll fetch the length first.

I believe the design should incentivize the correct management of pointers and memory regions, so my suggestion to keep pointing to the length first is based on that.

Comment on lines +107 to +109
Programs should read and validate the instruction data length (stored at `r2 - 8`)
before accessing data via the `r2` pointer. Failing to check the length could
result in reading unintended memory contents or out-of-bounds access attempts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly because of this, I would expect to read the length first. So pointing to the length would be more straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! I can go either way. My original PoC was pointing to length, until a few devs requested the actual data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for a pointer to the length.

Comment on lines +123 to +124
This feature is NOT backwards compatible for any programs that depend on the
uninitialized/garbage data previously in `r2`.
Copy link

@topointon-jump topointon-jump Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of this, this change should be feature-gated to reduce the risk of a divergence. It's possible that, for a given client implementation, the uninitialized data is the same across all the nodes. If a program is taking advantage of this (accidental or otherwise) and we change this value, then we could see a divergence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only uninitialized data from the PoV of the program, the runtime should always have set it to zero so far. Yes, the change needs a feature gate, but we have tested that it would work for all currently used programs on MNB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did plan to feature-gate this change. I thought all SIMDs were feature gates. If I need to call that out explicitly in the proposal, I can.

Comment on lines +40 to +42
When the feature is activated, the VM shall set register 2 (`r2`) to contain a
pointer to the beginning of the instruction data section within the input
region. The instruction data format remains unchanged:
Copy link

@topointon-jump topointon-jump Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long must the pointer retain this value for? The whole execution, or only for a certain period of time/number of instructions? What is considered "at program entrypoint" - just the first instruction? Do you mind clarifying in the SIMD? Thank-you!

Apologies if I missed this anywhere.

Copy link
Contributor

@Lichtso Lichtso Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the first instruction, r2 is the second argument register, thus a parameter of the entrypoint function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add this to the proposal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants