-
Notifications
You must be signed in to change notification settings - Fork 72
Add support for sorting of UTF16 keys #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
What exactly would be the benefit of this change? Why does it matter how the keys are sorted? |
It would allow to store strings as keys in lexicographical order. Imagine that you are implementing full-text search and store keywords as keys in a table. Then, if keys are properly sorted, you can query most relevant key to the queried string ("apple" for "app", "juice" for "ju" and so on). Currently LMDB only has lexicographical sorting for 8-bit character encodings (i.e., via strcmp() call). |
This patch as it stands makes existing lmdb files incompatible with the patched version. And vice versa of course. Which would be "bad". Please don't do this. The remainder discusses some alternatives.... It is an unfortunate historical accident that node likes to write little-endian utf16... But seeing as it does... This patch is solving the "problem" that node uses utf16le encoding for strings. And node-lmdb doesn't do any conversion. So ordering is "interesting" outside of Latin alphabet. The above approach (always convert the application key type to a byte string that compares correctly) is used in many LMDB projects (not node specific). That said, it wouldn't be completely unreasonable to add a specific, new key type - bigendianutf16string, and encourage its use in new "I just want to store node strings" apps. However, to avoid having the db so created become unusable "unwrapped" (e.g. it can't be used with the mdb util commandline tools) it should work by byte swapping (utf16le <-> utf16be) on read/write of keys NOT by using a custom compare function. |
Well, I do not think that this patch will change order of sorting of single byte char sets, so the compatibility with databases created by previous node-lmdb lib should not be broken.
Sorry but it will not work. You would need to specify custom comparison function which is utf8/utf16be/whatever aware.
We are probably talking about something different here. node-lmdb DOES perfectly support any encoding so far. The only issue I am trying to solve is correct searching of a string which requires correct string compare function. LMDB contains only single byte comparison (strcmp() or equivalent) so it cannot sort any multibyte strings. Conversion with iconv to any UTF encoding will not help here.
And will end up that, for example, Russian or Belarusian words will be out of order.
Again, this would not break the compatibility since original (embedded in LMDB) comparison compares single byte strings just the same as UTF16 Little Endian. And, actually, since LMDB is a library, not a stand-alone database, I would not say it really matters if you comply binary compatibility with something else, like OpenLDAP. |
Not sure how useful this would be with regards to full-text search, but I can imagine you can implement a subset of wildcard search with it, ie. it would find How exactly are you planning to implement your full-text search? |
In my case I am trying to implement drop-down list of suggestions in the search input box, just like on google.com |
I mean I already implemented it with aid of this patch and it works properly :) |
This is incorrect. Certainly trying to insert new keys or otherwise update the tree will not end well...
Sorry, it does work (as well as the code you presented does). Byte swapping (transcoding from utf16le to utf16be) does exactly what your code does in the comparison function (your code compares byte pairs by casting to uint16_t then comparing) but in a way that is actually portable (unlike your assumption that the entire world is little endian) and has the other advantages I outlined already.
See above. I'm not making this stuff up. Your use case is supported without custom compare if you correctly encode the key. Note that LMDB itself does not deal with "strings" at all - yes it compares sequences of bytes (definitely not strcmp) which, works just fine for lexicographical ordering of code units in a big endian encoding, of any width, by definition. The issue in the node-lmdb binding is that node-lmdb string interface copies little endian utf16 into the key. If it byte swapped (converted to big endian) east-west relations would be much improved ;-) You can do this outside of node-lmdb and use the node-lmdb binary interface (passing key as a correctly encoded buffer) to do so. As somewhat of an aside (but in explanation of why I suggested using changes in encoding not changes in comparison, and a reason why I suggested using libraries to do that) ordering utf16 code units doesn't always give absolutely correct ordering of code points. Just one of the reasons why utf16 is a horrible encoding... Neither your code nor a simple conversion to big endian deals with the problem of code point order (which, admittedly, doesn't matter for most strings). If you wanted to get it absolutely correct you could steal the comparison function presented here: https://ssl.icu-project.org/docs/papers/utf16_code_point_order.html ... OR you could concede that a key-value-store library wrapper is the wrong place for this and use icu iconv etc to convert to either utf8 or utf32, both of which sort perfectly using the built in lmdb comparison method...
No it doesn't. This isn't some theoretical discussion. It really works - why don't you try it.
You keep talking about single byte strings, by which I presume you mean ones less than \u0100. Even in this case your assertion it will "just work" is incorrect. And in any case you are not the only user of characters outside of latin-1.
I disagree because "binary compatibility with something else" includes cases like (this is a real-world example): Test/utility code written in Python. High performance code written in C++. Less compute intensive code written in node.js. All 3 access the same LMDB database, in some cases concurrently. The problem would also arise when simply mixing access to the same database between node.js programs using versions of node-lmdb with/without this change applied. |
@degifted @da77a Here is my opinion about this pull request: Unfortunately, it breaks compatibility with existing users of node-lmdb, so I can't accept it as-is. And then, there is a LOT more to string sorting. The code might cover your specific use case in your language, but it is not a universal solution because it is not aware of the locale / language of the users (or the web site you use it with). Different languages have different rules for sorting characters like I believe the best way to solve this is to create a wrapper for What do you guys think about that? |
I don't need (so don't take it as a request), but would have no issue with, the proposal I made to add a new key type NodeLmdbKeyType::Utf16BeStringKey. This allows users who just want things order as one would expect "by default" based on code units to get on with their lives and interops fine with other systems that store big endian utf-16. It directly addresses @degifted use case without his having to change any code that uses his patch other than to specify that key type. 10 points. But I haven't got time to write it so I'd understand if a modified @degifted patch was accepted IF the modification applied the custom sort ONLY if requested "somehow" (eg via a different keytype as above) and so addressed the obvious issue for the common case of UTF-16 sorting differently in node.js internally to the way it sorts in node-lmdb string indexes.... 5 points.
Locale dependent sort is - well - locale dependent. Rebuilding a key value store from scratch to present a correctly sorted list to a Lithuanian is not likely to be a widely called for feature - and in particular wasn't what @degifted proposed. What he did propose was sorting in code unit order. Which is "near enough to correct" for most uses/users. And we can get that without custom comparison (just store your bytes in the right order or use UTF-8 where there is only on order...). But sure, one could support it with high performance (but in a really difficult to configure and likely to surprise way) by including ICU C code (please make it an opt in build option not default). I'll assume this is unlikely to get done... 1 point.
But the performance would be bad (having to get the keys to be compared to/from javascript + the actual javascript compare code performance). And the fact the feature was there and no other feature was there to correctly (same as default javascript sort) order utf-16 code units would encourage this "fix" being used way more often than it should be. -2 points.
See points above... What I'm trying to say is - add the custom compare support, if at all, only after adding some default "reasonable" behaviour for utf16 strings (i.e. code unit order). |
Thanks guys for such a comprehensive covering of this topic. Let me summarize my thoughts on that.
So, I propose to leave this comparison fn as is (well, probably add fix-up as in the IBM code above but I doubt it is really necessary in that case) and add an option to switch it on/off. |
@degifted - I like your plan. My main point was (and also your point 3 above) is that the general solution is to build indexes by encoding keys through whatever reversible transformation function (be it table lookup or endian conversion or...). This doesn't need support from lmdb and by extension node-lmdb. And you have no plans to do anything in that area which is a good outcome from my perspective. That leaves the question of default behaviour and ease of use. The existing string key type is intended to provide sane results for naive use of lmdb from node.js. The accidental (I assume) default of byte string comparing big endian UTF-16 produces a collation that isn't terribly useful, but its what every existing instance of the default/common use of node-lmdb has produced. And even in what might be assumed (even by naive users) to be "8 bit" strings there are often, in UTF-16, stray chars outside the 0..\u0100 range (currency, various forms of quotes, hyphens, other chars depending on which latin codepage was being mapped from to produce that UTF encoding...) hence the need for a configurable option. |
@degifted @da77a Thank you guys for the great discussion. The idea to add another flag to I imagine something like this:
This would be future-proof in the sense that later on we could add other |
No description provided.