Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(util): attempt of performance improve on MIMEType class #46607

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

metcoder95
Copy link
Member

@metcoder95 metcoder95 commented Feb 10, 2023

Implements performance#38

The PR makes an attempt to improve the performance of the MIMEType class at the moment of parsing an incoming MIME string.
The vast majority of changes were made in an attempt of preserving current implementation meanwhile pushing the limits as far as possible.

As shown in the performance#38 comment, the improvements were about 12 - 20% on average.

Compared against Node v18

                                                                                  confidence improvement accuracy (*)   (**)  (***)
util/mime-parser.js n=100000 strings='application/json; charset="utf-8"'                 ***     12.07 %       ±2.78% ±3.72% ±4.87%
util/mime-parser.js n=100000 strings='text/html ;charset=gbk'                            ***      8.13 %       ±2.18% ±2.91% ±3.79%
util/mime-parser.js n=100000 strings='text/html; charset=gbk'                            ***      4.20 %       ±2.35% ±3.13% ±4.08%
util/mime-parser.js n=100000 strings='text/html;charset= "gbk"'                          ***     10.10 %       ±2.02% ±2.69% ±3.50%
util/mime-parser.js n=100000 strings='text/html;charset=GBK'                             ***     11.53 %       ±2.61% ±3.48% ±4.57%
util/mime-parser.js n=100000 strings='text/html;charset=gbk'                             ***      8.37 %       ±2.23% ±2.96% ±3.86%
util/mime-parser.js n=100000 strings='text/html;charset=gbk;charset=windows-1255'        ***     14.77 %       ±1.66% ±2.21% ±2.88%
util/mime-parser.js n=100000 strings='text/html;x=(;charset=gbk'                         ***     12.14 %       ±3.32% ±4.41% ±5.75%

The improvements were between ~8-12% on average, which is good but still not sure if enough to call it successful. On ops/sec, the results were unprecise as with the previous baseline of:

 * util#MIMEType x 1,349,103 ops/sec ±0.16% (92 runs sampled)
 * undici#parseMIMEType x 2,349,808 ops/sec ±0.14% (99 runs sampled)
 * undici#parseMIMEType(original) x 1,520,777 ops/sec ±0.26% (100 runs sampled)
 * fast-content-type-parse#parse x 5,663,227 ops/sec ±0.58% (98 runs sampled)
 * fast-content-type-parse#safeParse x 5,719,013 ops/sec ±0.44% (95 runs sampled)

the new baseline varied to:

util#MIMEType x 2,075,831 ops/sec ±0.14% (197 runs sampled)
undici#parseMIMEType x 2,729,673 ops/sec ±0.15% (194 runs sampled)
fast-content-type-parse#parse x 5,751,725 ops/sec ±0.13% (195 runs sampled)
fast-content-type-parse#safeParse x 5,766,105 ops/sec ±0.10% (195 runs sampled)
Fastest is fast-content-type-parse#safeParse

Almost ~50%, which seems to be quite out compared to the results of Node benchmarks. I wanted to put them on the table but wouldn't ensure they are precise as they were varying by ~20% on several iterations.

Happy to see any feedback on possible points I missed or further improvements that can be made 🙂

cc: @anonrig

@nodejs-github-bot nodejs-github-bot added the needs-ci PRs that need a full CI run. label Feb 10, 2023
@metcoder95 metcoder95 changed the title chore: attempt of performance improve on MIMEType class chore(util): attempt of performance improve on MIMEType class Feb 15, 2023
@metcoder95 metcoder95 force-pushed the perf/mime_type branch 2 times, most recently from fe82488 to c6e4132 Compare February 20, 2023 16:23
@metcoder95 metcoder95 marked this pull request as ready for review February 20, 2023 16:30

function escapeQuoteOrSolidus(str) {
let result = '';
for (let i = 0; i < str.length; i++) {
const char = str[i];
result += (char === '"' || char === '\\') ? `\\${char}` : char;
const code = StringPrototypeCharCodeAt(char, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Ideally, if we don't have quote or backslash character, we don't need to create a result string and append it for each character. I don't know the probability of having this happy path, but if it's common, we might just check on the beginning and return the str

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had in mind using the Regex#matchAll but it requires having the Regex marked as global (making them stateful), which can make usages of Regex tricky. Matter of tradeoffs maybe?

Though, this only affects while using the toString method of the class (template literals, concatenation, etc.). At parsing is not involved at all 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RexExp.prototype.matchAll creates a RegExpStringIterator object, and it has all kinds of prototype pollution vulnerabilities. I don't think it's the right tool for lib/ (and doubt it would be more performant than the alternatives).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is only performant at the extent of the regex iterating over the string at once and having all the indices where this applies, so the escaping does not need to iterate once again over the string.

But I agree that the security side it is more important here 👍

@metcoder95
Copy link
Member Author

Hi @aduh95, I noticed that RegExpPrototypeTest is marked as dangerous as it makes a lookup to the RegExp.prototype.exec of the chain.

Sadly this comes with regressions in the overall performance (~10%). Is there anything we can do to also speed up the Primordial?
Asking as I noticed that on net#isIPv4 is marked as allowed but subject to improvement once performance gets better

@aduh95
Copy link
Contributor

aduh95 commented Mar 3, 2023

@metcoder95 what are you comparing it with?

@metcoder95
Copy link
Member Author

RegExpPrototypeExec, which is the closest one.

This is the small benchmark I ran:

const regex = /foo*/;
const regexTest = /foo*/;
const regexExec = /foo*/;
const str = 'table football, foosball';

suite
  .add('RegExp#exec', function () {
    regexExec.exec(str)[0];
  })
  .add('RegExp#test', function () {
    regexTest.test(str);
  })
  .add('String#search', function () {
    str.search(regex);
  })
  .add('String#match', function () {
    str.match(regex);
  })
  .on('cycle', function (event) {
    console.log(String(event.target))
  })
  .on('complete', function () {
    console.log('Fastest is ' + this.filter('fastest').map('name'))
  })
  .run({ async: true })

Results:

RegExp#exec x 37,341,962 ops/sec ±0.24% (96 runs sampled)
RegExp#test x 66,946,944 ops/sec ±0.09% (98 runs sampled)
String#search x 53,180,141 ops/sec ±0.06% (93 runs sampled)
String#match x 36,475,617 ops/sec ±0.18% (99 runs sampled)

The benchmark is pretty rudimentary but it highlights a significative difference.

@aduh95
Copy link
Contributor

aduh95 commented Mar 3, 2023

You should also compare the primordials to have a more complete benchmark. You can access them if you use --expose-internals flag and import them like this:

const {
  RegExpPrototypeExec,
  RegExpPrototypeTest,
  StringPrototypeSearch,
  StringPrototypeMatch,
} = require('internal/test/binding').primordials;

You should also test adding === null after the exec calls, in case V8 is able to recognize the pattern.

1 similar comment
@aduh95
Copy link
Contributor

aduh95 commented Mar 3, 2023

You should also compare the primordials to have a more complete benchmark. You can access them if you use --expose-internals flag and import them like this:

const {
  RegExpPrototypeExec,
  RegExpPrototypeTest,
  StringPrototypeSearch,
  StringPrototypeMatch,
} = require('internal/test/binding').primordials;

You should also test adding === null after the exec calls, in case V8 is able to recognize the pattern.

@aduh95
Copy link
Contributor

aduh95 commented Mar 3, 2023

You can also add the case where we use hardenRegExp for completeness sake, although it's unlikely to have better perf results.

@metcoder95
Copy link
Member Author

Sure! I made the changes suggested, and the results didn't vary that much:

RegExp#exec x 37,376,162 ops/sec ±0.25% (92 runs sampled)
RegExp#test x 66,952,599 ops/sec ±0.07% (98 runs sampled)
String#search x 53,146,898 ops/sec ±0.09% (98 runs sampled)
String#match x 35,598,485 ops/sec ±0.26% (98 runs sampled)
const {
    RegExpPrototypeExec,
    RegExpPrototypeTest,
    StringPrototypeSearch,
    StringPrototypeMatch,
  } = require('internal/test/binding').primordials;
const Benchmark = require('benchmark')

const suite = new Benchmark.Suite()

const regex = /foo*/;
const regexTest = /foo*/;
const regexExec = /foo*/;
const str = 'table football, foosball';

suite
  .add('RegExp#exec', function () {
    RegExpPrototypeExec(regexExec, str) === null;
  })
  .add('RegExp#test', function () {
    RegExpPrototypeTest(regexTest, str) === false;
  })
  .add('String#search', function () {
    StringPrototypeSearch(str, regex) === -1;
  })
  .add('String#match', function () {
    StringPrototypeMatch(str, regex) === null;
  })
  .on('cycle', function (event) {
    console.log(String(event.target))
  })
  .on('complete', function () {
    console.log('Fastest is ' + this.filter('fastest').map('name'))
  })
  .run({ async: true })

This is actually nice, it seems that Primordials doesn't add any overhead

@aduh95
Copy link
Contributor

aduh95 commented Mar 3, 2023

So according to your results, RegExpPrototypeTest looks to be twice as fast as RegExpPrototypeExec and SafeStringPrototypeSearch. It'd be nice if V8 exposed a way to reimplement an alternative to %RegExp.prototype.test% that would be as fast but without doing any prototype lookup, but I don't know how to make that. Unfortunately, until then it seems we have to choose between robustness and speed.

@metcoder95
Copy link
Member Author

metcoder95 commented Mar 5, 2023

Understood, I'll fallback to SafeStringPrototypeSearch instead, I'm expecting a regression of ~10%, but will update the numbers afterward.

Thanks for the instructions @aduh95 🙇
cc: @anonrig

@metcoder95
Copy link
Member Author

I made the changes accordingly and it had a reduction of ~5% which stills marks an improvement.

1,952,986 ops/sec

Though, I'm having a little bit of struggles understanding what's wrong with the testing and why are they timing out, any guidance would be super helpful 😅

@anonrig
Copy link
Member

anonrig commented Mar 8, 2023

any guidance would be super helpful 😅

I reached out from OpenJSF Slack

@metcoder95
Copy link
Member Author

Thanks @anonrig, I actually thing found the culprit. It seems a bad parsing of a given MIME is causing the issue, I'll look through the day 👍

@aduh95
Copy link
Contributor

aduh95 commented Mar 10, 2023

It's probably worth trying out SafeRegExp like Deno: denoland/deno#17592
Could you try to add that to your benchmark suite?

@metcoder95
Copy link
Member Author

Thanks, let me give it a try and I'll share the results as well 👍

@metcoder95
Copy link
Member Author

metcoder95 commented Mar 10, 2023

I found this line that hardens regex patterns on primordials.
Is it possible to use it instead?

https://github.com/nodejs/node/blob/359172868fa186bf8647fd58fa091d932481cbd4/lib/internal/per_context/primordials.js#L636C17-L722

@aduh95
Copy link
Contributor

aduh95 commented Mar 10, 2023

Is it possible to use it instead?

Of course you can, but it has terrible performance.

@metcoder95
Copy link
Member Author

Of course you can, but it has terrible performance.

Well, was worth it to ask 🤷‍♂️

@aduh95
Copy link
Contributor

aduh95 commented Mar 10, 2023

FYI here are the results I've found:

RegExpPrototypeExec x 13,128,237 ops/sec ±6.29% (87 runs sampled)
RegExpPrototypeTest x 7,634,710 ops/sec ±1.71% (94 runs sampled)
RegExp#test x 24,493,324 ops/sec ±0.17% (90 runs sampled)
RegExpPrototypeTest+hardenedRegExp x 5,678,249 ops/sec ±0.29% (94 runs sampled)
SafeRegExp#test x 6,288,065 ops/sec ±0.15% (92 runs sampled)
SafeStringPrototypeSearch x 13,386,956 ops/sec ±1.67% (93 runs sampled)
RegExpPrototypeSymbolSearch x 6,099,046 ops/sec ±0.39% (91 runs sampled)
Code
'use strict';
const {
    RegExpPrototypeExec,
    RegExpPrototypeTest,
    RegExpPrototypeSymbolSearch,
    SafeStringPrototypeSearch,
    hardenRegExp,
    makeSafe,
  } = require('internal/test/binding').primordials;
const Benchmark = require('benchmark')

const SafeRegExp = makeSafe(RegExp,
    class SafeRegExp extends RegExp {}
)

const suite = new Benchmark.Suite();

const regex = /foo*/;
const regexTest = /foo*/;
const regexHardenedTest = hardenRegExp(/foo*/);
const safeRegexTest = new SafeRegExp(regexHardenedTest);
const regexExec = /foo*/;
const str = 'table football, foosball';

suite
  .add('RegExpPrototypeExec', function () {
    RegExpPrototypeExec(regexExec, str) === null;
  })
  .add('RegExpPrototypeTest', function () {
    RegExpPrototypeTest(regexTest, str) === false;
  })
  .add('RegExp#test', function () {
    regexTest.test(str) === false;
  })
  .add('RegExpPrototypeTest+hardenedRegExp', function () {
    RegExpPrototypeTest(regexHardenedTest, str) === false;
  })
  .add('SafeRegExp#test', function () {
    safeRegexTest.test(str) === false;
  })
  .add('SafeStringPrototypeSearch', function () {
    SafeStringPrototypeSearch(str, regex) === -1;
  })
  .add('RegExpPrototypeSymbolSearch', function () {
    RegExpPrototypeSymbolSearch(regex, str) === -1;
  })
  .on('cycle', function (event) {
    console.log(String(event.target))
  })
  .on('complete', function () {
    console.log('Fastest is ' + this.filter('fastest').map('name'))
  })
  .run({ async: true });

AFAICT, SafeStringPrototypeSearch is still the best available option if we don't want prototype pollution.

@metcoder95
Copy link
Member Author

AFAICT, SafeStringPrototypeSearch is still the best option if we don't want prototype pollution.

Agree, the changes I made reflected ~5% regression, but they were worth it compared to the progression made. I just have struggled with some scenarios that need adjustments.

Hope can have something by eow 👍

@metcoder95
Copy link
Member Author

Update on this PR

Sadly haven't had the time to properly focus on this, might have some next week but has to be seen 🙁
On the other hand, I made a step back and tried another approach and that was to be more explicit about the standard steps on how to parse the mime string.
Made a few changes, and adjusted. Performance-wise, it is on the same line as the previous benchmark but by trying to trick bit the standard for performance, I facing some smaller blockers with certain cases.

Solved the timeout issue, but the edge cases are the ones to cover now. Feel free to have a look and I'll incorporate feedback if any. Thanks! 🙂

Comment on lines 252 to 253
name = `${name}`;
value = `${value}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why though?
can value and name be not a string?
maybe creating a string only if it is not a string is faster.

Suggested change
name = `${name}`;
value = `${value}`;
typeof name !== 'string' && (name = `${name}`);
typeof value !== 'string'&& (value = `${value}`);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I believe that it might be to trigger the toString function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but why would you need that on a string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine that is in case the value passed isn't a string so we enforce the primitive representation or whatever toString returns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe creating a string only if it is not a string is faster.

I doubt it, V8 should already have it optimized, I would expect doing the test manually as suggested above would be slower than not doing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already added most of the points, and beyond some allocations that makes a substantial difference (~10%), most of the string optimizations showed negligible improvements.

I'll just check the tests passed as before and will push the changes for review

@@ -182,103 +298,103 @@ class MIMEParams {
for (const { 0: key, 1: value } of this.#data) {
const encoded = encode(value);
// Ensure they are separated
if (ret.length) ret += ';';
if (ret.length !== 0) ret += ';';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ret.length !== 0) ret += ';';
ret.length !== 0 && (ret += ';');

Copy link
Contributor

@Uzlopak Uzlopak Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe even

ret.length !== 0 && (ret += `;${key}=${encoded}`) || (ret += `${key}=${encoded}`);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on that suggestion, the version with if is much more readable.

continue;
}
const endOfSource = str.length;
const marker = { position };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is marker an object, because we want to keep it as byRef parameter?, needs comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the main reason. To use as a checkpoint record for the iterations

StringPrototypeSlice(str, marker.position, valueEnd);

// #11-9-2 Remove any trailing HTTP whitespace from parameterValue.
const trimmedValue = StringPrototypeTrimEnd(rawValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe first check if the last character is a whitespace and then call if so trimEnd.

Comment on lines +362 to +365
marker.position += trimmedValue.length;

// #11-9-3 If parameterValue is the empty string, then continue.
if (trimmedValue.length === 0) continue;
Copy link
Contributor

@Uzlopak Uzlopak Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch positions as if trimmedValue.length === 0 means that marker.position += trimmedValue.length is doing nothing.

Suggested change
marker.position += trimmedValue.length;
// #11-9-3 If parameterValue is the empty string, then continue.
if (trimmedValue.length === 0) continue;
// #11-9-3 If parameterValue is the empty string, then continue.
if (trimmedValue.length === 0) continue;
marker.position += trimmedValue.length;

Comment on lines +384 to +388
SafeStringPrototypeSearch(
parameterValue,
NOT_HTTP_QUOTED_STRING_CODE_POINT,
) === -1 &&
params.has(parameterName) === false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt .has on Map O(1)?

Suggested change
SafeStringPrototypeSearch(
parameterValue,
NOT_HTTP_QUOTED_STRING_CODE_POINT,
) === -1 &&
params.has(parameterName) === false
params.has(parameterName) === false &&
SafeStringPrototypeSearch(
parameterValue,
NOT_HTTP_QUOTED_STRING_CODE_POINT,
) === -1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I'm afraid I'm not following 🤔

this.#type = data.type;
this.#subtype = data.subtype;
// #1 Remove any leading and trailing HTTP whitespace from input
string = StringPrototypeTrim(`${string}`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
string = StringPrototypeTrim(`${string}`);
typeof string !== 'string' && (string = `${string}`)
string = StringPrototypeTrim(string);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%String.prototype.trim% always call ToString per spec, it shouldn't be necessary to to convert it first.

Suggested change
string = StringPrototypeTrim(`${string}`);
string = StringPrototypeTrim(string);

const { type, subtype, parametersStringIndex } = parseTypeAndSubtype(string);
this.#type = type;
this.#subtype = subtype;
this.#essence = `${this.#type}/${this.#subtype}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.#essence = `${this.#type}/${this.#subtype}`;
this.#essence = `${type}/${subtype}`;

Comment on lines 444 to +445
const invalidTypeIndex = SafeStringPrototypeSearch(v, NOT_HTTP_TOKEN_CODE_POINT);
if (v.length === 0 || invalidTypeIndex !== -1) {
if (invalidTypeIndex !== -1 || v.length === 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this maybe better?

    if (v.length === 0) {
      throw new ERR_INVALID_MIME_SYNTAX('type', v, -1);
    }

    const invalidTypeIndex = SafeStringPrototypeSearch(v, NOT_HTTP_TOKEN_CODE_POINT);
    if (invalidTypeIndex !== -1) {
      throw new ERR_INVALID_MIME_SYNTAX('type', v, invalidTypeIndex);
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can even merge the SafeStringPrototypeSearch within the if statement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to pass the invalidTypeIndex to the error, then you have to do it outside the if

Copy link
Contributor

@Uzlopak Uzlopak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some thoughts. maybe you can press some op/s out of my suggestions. looking for feedback

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 30, 2023

Btw. If you use benchmarkjs for benching and use async true (for what reason ever) then you could set delay to 0.

@metcoder95
Copy link
Member Author

added some thoughts. maybe you can press some op/s out of my suggestions. looking for feedback

Thanks for the feedback @Uzlopak, they are a fresh look over the PR about several optimisations a missed 👌
Has been a while since I visited the PR, so let me take the baseline performance and implement some optimizations based on your feedback to compare results

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 30, 2023

This PR got me actually the motivation to clone the node repo and try to get it built and run some benchmarks

@metcoder95
Copy link
Member Author

I'm re-taking this PR. I'm giving it a try with the recommendations from @Uzlopak and doing some experiments with direct buffers to see if we can speed this ahead since we start parsing the MIMEType string.

I'd hope to have something by end of week 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants