Skip to content

Regular expression export/import #9976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sverker
Copy link
Contributor

@sverker sverker commented Jun 18, 2025

Problem

Before OTP 28.0 it was possible to abuse the compiled format of regular expressions as returned by re:compile as if it was a serialized format to be imported into other Erlang node instances. This abuse happened to work as long as the underlying hardware architecture and PCRE version was not too incompatible. But it was unsafe as any unpleasant behavior could be the result of passing an incompatible compiled regular expression to re:run.

In OTP 28.0 the compiled format has changed to not expose the internals of PCRE but instead return a safe (magic) reference to the internal regex structures. A compiled regex is now safe but can only be used in the node instance that compiled it.

Solution

This PR introduces a supported safe way to export compiled regular expressions. The exported format is self-contained and can be stored off-node or sent to another nodes. If the importing node is compatible (architecture and PCRE version), then the compiled regex can be used directly with minimal overhead. If not compatible, then the regular expression will be recompiled from the original string and options which are included as a fallback in the exported format.

Usage

% Use 'export' option to re:compile
{ok, Exported} = re:compile(RegexString, [export | OtherOptions]),

then in a potentially other node do

Imported = re:import(Exported),

re:run(Subject, Imported),

Exported format

The exported format is opaque but look currently like this:

{re_exported_pattern, HeaderBin, OrigBin, OrigOpts, EncodedBin}

  • EncodedBin - binary containing the compiled regex as encoded by pcre2_serialize_encode()
  • HeaderBin - binary with some meta information including a CRC checksum over EncodedBin
  • OrigBin - original regular expression as a binary string
  • OrigOpts - options passed to re:compile/2.

Future optimization

For users that earlier generated Erlang code with compiled regular expressions as literals would now instead compile with option export and generate re:import(Literal) instead of just the literal. If done like that, the beam loader could be optimized to detect such calls to re:import with literals as arguments, evaluate the calls in load-time and replace them with just the returned compiled regular expression as a literal term.

sverker added 4 commits June 3, 2025 18:56
Split off build_compile_error() from build_compile_result().
"make opt debug" will build  one target at a time
but each targets' sub-makefile may build in parallel.

This to avoid corrupted files when the same file is generated
from two Makefile invocations.
@sverker sverker requested a review from rickard-green June 18, 2025 18:33
@sverker sverker self-assigned this Jun 18, 2025
@sverker sverker added team:VM Assigned to OTP team VM enhancement labels Jun 18, 2025
@sverker
Copy link
Contributor Author

sverker commented Jun 18, 2025

@josevalim What do you think about this?

Copy link
Contributor

github-actions bot commented Jun 18, 2025

CT Test Results

    4 files    228 suites   1h 53m 38s ⏱️
3 728 tests 3 624 ✅ 104 💤 0 ❌
4 857 runs  4 728 ✅ 129 💤 0 ❌

Results for commit 0597625.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@josevalim
Copy link
Contributor

I believe this is fantastic and simplifies many of the issues we had to tackle in Elixir. Thank you.

It would be fantastic if this could be used from Erlang too. Perhaps a pass in the compiler will rewrite re:compile into re:import?

Also, do you see this making to 28.1 or would it be 29 only?

@sverker sverker added this to the OTP-28.1 milestone Jun 19, 2025
@sverker
Copy link
Contributor Author

sverker commented Jun 19, 2025

The plan is to get this export/import functionality into 28.1. And then potentially do the loader optimization later maybe already in 28.2.

@josevalim
Copy link
Contributor

@sverker making it part of 28.1 would help Elixir codebases migrate to latest OTP, so thank you.

I have one additional question: do you think it is reasonable for re:run to automatically import an exported regex? I am thinking about the multi-node scenario, where you would need to explicitly import messages across nodes (which could be arbitrarily nested), so having it just work is beneficial. Or are you worried about importing being expensive if we have to do it on every operation?

@josevalim
Copy link
Contributor

I have one additional thought: what if the export is part of the existing tagged tuple? For example, you can add a new field to {re_pattern, _, _, _, _} that returns the export or the atom none. If exported, then you can transparently send it across nodes or run it locally with no performance cost. The receiving node can also run it transparently but it has the option of importing it to make sure it is optimised. What do you think would be the pros and cons of this approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants