You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Snowball originally only generated C code and the generated code leaned quite heavily on the C compiler to optimise away redundancy. We've improved the generated code over time but the JS code still benefits from a compiler-like pass.
Currently we're using closure-compiler (which effectively compiles JS to JS, optimising on the way) to produce a much smaller JS bundle for use in our website demo. Unfortunately this doesn't like the code from the proposed ES6 changes in #221.
Sphinx-doc uses uglifyjs to compress Snowball-generated JS code. Comparing the results of the two, uglifyjs is not quite as effective on our generated code, but it's pretty close and we should be able to reduce the gap. It may be hard to eliminate it completely as AIUI uglifyjs doesn't work like an optimising compiler, but perhaps we can get Snowball to generate better code for the cases which currently benefit from a compilation-like pass. It's likely this work would benefit Python too.
One obvious extra thing uglifyjs does is to change Unicode escapes in string literals to UTF-8 encoded source code (so e.g. \u640 becomes a two byte UTF-8 sequence saving 3 bytes). If I add --charset UTF-8 for closure-compiler that brings its output down to 263504 bytes (and if UTF-8 encoded Javascript source is OK then the Snowball compiler could easily produce it directly - since v3.0.0 it actually does for target languages which clearly document the default source encoding is UTF-8 or a way to specify that it is, but I failed to find that info for Javascript - e.g. https://tc39.es/ecma262/multipage/ecmascript-language-source-code.html#sec-ecmascript-language-source-code says "The actual encodings used to store and interchange ECMAScript source text is not relevant to this specification").
Comparing on the snowball-website repo (so including demo.js for both) and allowing UTF-8 JS source I get 288406 bytes with uglifyjs vs 263504 with closure-compiler (with --charset UTF-8 added), so uglifyjs output is about 9.5% larger. That seems tolerable and both are smaller than closure-compiler without specifying UTF-8 output, though the number of stemming languages and hence total code size will continue to grow over time so I'd be happier to achieve a more similar size reduction with uglifyjs, or find how to make closure-compiler work with modernised JS output.
I ran js-beautify on the output of both compressors. The functions aren't in the same order so diff doesn't really help, but finding matching functions in each version by hand reveals some differences, and also some possible size savings neither compressor currently gives us:
uglifyjs doesn't rename BaseStemmer, but that is a very minor difference as it only gets referenced once per subclass
uglifyjs doesn't rename methods of BaseStemmer which likely makes a significant difference. We could just use short method names in base-stemmer.js and the generated code (or see if there's an option).
Neither compressor changes the names of attributes of BaseStemmer such as cursor, limit_backwards, etc. We could use short names for these to get an additional size saving (or see if there's an option). Ironically the generated C code (where the identifier length doesn't matter) uses shorter names.
The second element of each "among" entry is very often -1 - if we swapped the second and third entries we could make this element optional and save some code size.
The text was updated successfully, but these errors were encountered:
The second element of each "among" entry is very often -1 - if we swapped the second and third entries we could make this element optional and save some code size.
The third element is also often -1, but a quick tally with grep -c seems to show 1041 instances of the third being -1 and 4676 of the second (ignoring the rare cases which use an among function and have 4 elements). So wrapping and making optional should save 4674*4 bytes (or 3673*3 after minification/compression since the only change here seems to be dropping whitespace).
This ticket is spun off discussion in sphinx-doc/sphinx#13561
Snowball originally only generated C code and the generated code leaned quite heavily on the C compiler to optimise away redundancy. We've improved the generated code over time but the JS code still benefits from a compiler-like pass.
Currently we're using closure-compiler (which effectively compiles JS to JS, optimising on the way) to produce a much smaller JS bundle for use in our website demo. Unfortunately this doesn't like the code from the proposed ES6 changes in #221.
Sphinx-doc uses uglifyjs to compress Snowball-generated JS code. Comparing the results of the two, uglifyjs is not quite as effective on our generated code, but it's pretty close and we should be able to reduce the gap. It may be hard to eliminate it completely as AIUI uglifyjs doesn't work like an optimising compiler, but perhaps we can get Snowball to generate better code for the cases which currently benefit from a compilation-like pass. It's likely this work would benefit Python too.
One obvious extra thing uglifyjs does is to change Unicode escapes in string literals to UTF-8 encoded source code (so e.g. \u640 becomes a two byte UTF-8 sequence saving 3 bytes). If I add --charset UTF-8 for closure-compiler that brings its output down to 263504 bytes (and if UTF-8 encoded Javascript source is OK then the Snowball compiler could easily produce it directly - since v3.0.0 it actually does for target languages which clearly document the default source encoding is UTF-8 or a way to specify that it is, but I failed to find that info for Javascript - e.g. https://tc39.es/ecma262/multipage/ecmascript-language-source-code.html#sec-ecmascript-language-source-code says "The actual encodings used to store and interchange ECMAScript source text is not relevant to this specification").
Comparing on the snowball-website repo (so including demo.js for both) and allowing UTF-8 JS source I get 288406 bytes with uglifyjs vs 263504 with closure-compiler (with
--charset UTF-8
added), so uglifyjs output is about 9.5% larger. That seems tolerable and both are smaller than closure-compiler without specifying UTF-8 output, though the number of stemming languages and hence total code size will continue to grow over time so I'd be happier to achieve a more similar size reduction with uglifyjs, or find how to make closure-compiler work with modernised JS output.I ran
js-beautify
on the output of both compressors. The functions aren't in the same order sodiff
doesn't really help, but finding matching functions in each version by hand reveals some differences, and also some possible size savings neither compressor currently gives us:uglifyjs
doesn't renameBaseStemmer
, but that is a very minor difference as it only gets referenced once per subclassuglifyjs
doesn't rename methods ofBaseStemmer
which likely makes a significant difference. We could just use short method names inbase-stemmer.js
and the generated code (or see if there's an option).BaseStemmer
such ascursor
,limit_backwards
, etc. We could use short names for these to get an additional size saving (or see if there's an option). Ironically the generated C code (where the identifier length doesn't matter) uses shorter names.-1
- if we swapped the second and third entries we could make this element optional and save some code size.The text was updated successfully, but these errors were encountered: