Skip to content

Javascript minification/compression #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ojwb opened this issue May 19, 2025 · 1 comment
Open

Javascript minification/compression #236

ojwb opened this issue May 19, 2025 · 1 comment

Comments

@ojwb
Copy link
Member

ojwb commented May 19, 2025

This ticket is spun off discussion in sphinx-doc/sphinx#13561

Snowball originally only generated C code and the generated code leaned quite heavily on the C compiler to optimise away redundancy. We've improved the generated code over time but the JS code still benefits from a compiler-like pass.

Currently we're using closure-compiler (which effectively compiles JS to JS, optimising on the way) to produce a much smaller JS bundle for use in our website demo. Unfortunately this doesn't like the code from the proposed ES6 changes in #221.

Sphinx-doc uses uglifyjs to compress Snowball-generated JS code. Comparing the results of the two, uglifyjs is not quite as effective on our generated code, but it's pretty close and we should be able to reduce the gap. It may be hard to eliminate it completely as AIUI uglifyjs doesn't work like an optimising compiler, but perhaps we can get Snowball to generate better code for the cases which currently benefit from a compilation-like pass. It's likely this work would benefit Python too.

One obvious extra thing uglifyjs does is to change Unicode escapes in string literals to UTF-8 encoded source code (so e.g. \u640 becomes a two byte UTF-8 sequence saving 3 bytes). If I add --charset UTF-8 for closure-compiler that brings its output down to 263504 bytes (and if UTF-8 encoded Javascript source is OK then the Snowball compiler could easily produce it directly - since v3.0.0 it actually does for target languages which clearly document the default source encoding is UTF-8 or a way to specify that it is, but I failed to find that info for Javascript - e.g. https://tc39.es/ecma262/multipage/ecmascript-language-source-code.html#sec-ecmascript-language-source-code says "The actual encodings used to store and interchange ECMAScript source text is not relevant to this specification").

Comparing on the snowball-website repo (so including demo.js for both) and allowing UTF-8 JS source I get 288406 bytes with uglifyjs vs 263504 with closure-compiler (with --charset UTF-8 added), so uglifyjs output is about 9.5% larger. That seems tolerable and both are smaller than closure-compiler without specifying UTF-8 output, though the number of stemming languages and hence total code size will continue to grow over time so I'd be happier to achieve a more similar size reduction with uglifyjs, or find how to make closure-compiler work with modernised JS output.

I ran js-beautify on the output of both compressors. The functions aren't in the same order so diff doesn't really help, but finding matching functions in each version by hand reveals some differences, and also some possible size savings neither compressor currently gives us:

  • uglifyjs doesn't rename BaseStemmer, but that is a very minor difference as it only gets referenced once per subclass
  • uglifyjs doesn't rename methods of BaseStemmer which likely makes a significant difference. We could just use short method names in base-stemmer.js and the generated code (or see if there's an option).
  • Neither compressor changes the names of attributes of BaseStemmer such as cursor, limit_backwards, etc. We could use short names for these to get an additional size saving (or see if there's an option). Ironically the generated C code (where the identifier length doesn't matter) uses shorter names.
  • The second element of each "among" entry is very often -1 - if we swapped the second and third entries we could make this element optional and save some code size.
@ojwb
Copy link
Member Author

ojwb commented May 19, 2025

  • The second element of each "among" entry is very often -1 - if we swapped the second and third entries we could make this element optional and save some code size.

The third element is also often -1, but a quick tally with grep -c seems to show 1041 instances of the third being -1 and 4676 of the second (ignoring the rare cases which use an among function and have 4 elements). So wrapping and making optional should save 4674*4 bytes (or 3673*3 after minification/compression since the only change here seems to be dropping whitespace).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant