Skip to content

Add support for sorting of UTF16 keys #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

degifted
Copy link

No description provided.

@Venemo
Copy link
Owner

Venemo commented Jun 20, 2017

What exactly would be the benefit of this change? Why does it matter how the keys are sorted?

@degifted
Copy link
Author

It would allow to store strings as keys in lexicographical order. Imagine that you are implementing full-text search and store keywords as keys in a table. Then, if keys are properly sorted, you can query most relevant key to the queried string ("apple" for "app", "juice" for "ju" and so on). Currently LMDB only has lexicographical sorting for 8-bit character encodings (i.e., via strcmp() call).

@da77a
Copy link
Contributor

da77a commented Jun 21, 2017

This patch as it stands makes existing lmdb files incompatible with the patched version. And vice versa of course. Which would be "bad". Please don't do this. The remainder discusses some alternatives....

It is an unfortunate historical accident that node likes to write little-endian utf16...

But seeing as it does...

This patch is solving the "problem" that node uses utf16le encoding for strings. And node-lmdb doesn't do any conversion. So ordering is "interesting" outside of Latin alphabet.
Conventionally, if you care about order you convert to utf8 or utf16be using, say, inconv - and uses the binary key interface. The fact that iconv, inconv-lite etc exist and are to varying extents non-trivial suggests adding features to support encodings into node-lmdb isn't a great path to start down.

The above approach (always convert the application key type to a byte string that compares correctly) is used in many LMDB projects (not node specific).

That said, it wouldn't be completely unreasonable to add a specific, new key type - bigendianutf16string, and encourage its use in new "I just want to store node strings" apps. However, to avoid having the db so created become unusable "unwrapped" (e.g. it can't be used with the mdb util commandline tools) it should work by byte swapping (utf16le <-> utf16be) on read/write of keys NOT by using a custom compare function.

@degifted
Copy link
Author

degifted commented Jun 21, 2017

This patch as it stands makes existing lmdb files incompatible with the patched version. And vice versa of course. Which would be "bad". Please don't do this.

Well, I do not think that this patch will change order of sorting of single byte char sets, so the compatibility with databases created by previous node-lmdb lib should not be broken.

Conventionally, if you care about order you convert to utf8 or utf16be using, say, inconv - and uses the binary key interface.

Sorry but it will not work. You would need to specify custom comparison function which is utf8/utf16be/whatever aware.

The fact that iconv, inconv-lite etc exist and are to varying extents non-trivial suggests adding features to support encodings into node-lmdb isn't a great path to start down.

We are probably talking about something different here. node-lmdb DOES perfectly support any encoding so far. The only issue I am trying to solve is correct searching of a string which requires correct string compare function. LMDB contains only single byte comparison (strcmp() or equivalent) so it cannot sort any multibyte strings. Conversion with iconv to any UTF encoding will not help here.

The above approach (always convert the application key type to a byte string that compares correctly) is used in many LMDB projects (not node specific).

And will end up that, for example, Russian or Belarusian words will be out of order.

That said, it wouldn't be completely unreasonable to add a specific, new key type - bigendianutf16string, and encourage its use in new "I just want to store node strings" apps. However, to avoid having the db so created become unusable "unwrapped" (e.g. it can't be used with the mdb util commandline tools) it should work by byte swapping (utf16le <-> utf16be) on read/write of keys NOT by using a custom compare function.

Again, this would not break the compatibility since original (embedded in LMDB) comparison compares single byte strings just the same as UTF16 Little Endian. And, actually, since LMDB is a library, not a stand-alone database, I would not say it really matters if you comply binary compatibility with something else, like OpenLDAP.

@Venemo
Copy link
Owner

Venemo commented Jun 21, 2017

It would allow to store strings as keys in lexicographical order. Imagine that you are implementing full-text search and store keywords as keys in a table. Then, if keys are properly sorted, you can query most relevant key to the queried string

Not sure how useful this would be with regards to full-text search, but I can imagine you can implement a subset of wildcard search with it, ie. it would find apple when you search for app* but would not be useful with more advanced use cases, eg. *ple would not find apple (unless of course you also store it backwards, which then would still not solve cases like a*le).

How exactly are you planning to implement your full-text search?
I'm happy to hear about different use cases of node-lmdb so it's always nice to hear about such stuff.

@degifted
Copy link
Author

In my case I am trying to implement drop-down list of suggestions in the search input box, just like on google.com

@degifted
Copy link
Author

I mean I already implemented it with aid of this patch and it works properly :)

@da77a
Copy link
Contributor

da77a commented Jun 21, 2017

This patch as it stands makes existing lmdb files incompatible with the patched version. And vice versa of course. Which would be "bad". Please don't do this.

Well, I do not think that this patch will change order of sorting of single byte char sets, so the compatibility with databases created by previous node-lmdb lib should not be broken.

This is incorrect. Certainly trying to insert new keys or otherwise update the tree will not end well...

Conventionally, if you care about order you convert to utf8 or utf16be using, say, inconv - and uses the binary key interface.

Sorry but it will not work. You would need to specify custom comparison function which is utf8/utf16be/whatever aware.

Sorry, it does work (as well as the code you presented does). Byte swapping (transcoding from utf16le to utf16be) does exactly what your code does in the comparison function (your code compares byte pairs by casting to uint16_t then comparing) but in a way that is actually portable (unlike your assumption that the entire world is little endian) and has the other advantages I outlined already.

The fact that iconv, inconv-lite etc exist and are to varying extents non-trivial suggests adding features to support encodings into node-lmdb isn't a great path to start down.

We are probably talking about something different here. node-lmdb does perfectly support any encoding so far. The only issue I am trying to solve is correct searching of a string which requires correct string compare function. LMDB contains only single byte comparison (strcmp() or equivalent) so it cannot sort any multibyte strings. Conversion with iconv to any UTF encoding will not help here.

See above. I'm not making this stuff up. Your use case is supported without custom compare if you correctly encode the key. Note that LMDB itself does not deal with "strings" at all - yes it compares sequences of bytes (definitely not strcmp) which, works just fine for lexicographical ordering of code units in a big endian encoding, of any width, by definition. The issue in the node-lmdb binding is that node-lmdb string interface copies little endian utf16 into the key. If it byte swapped (converted to big endian) east-west relations would be much improved ;-) You can do this outside of node-lmdb and use the node-lmdb binary interface (passing key as a correctly encoded buffer) to do so.

As somewhat of an aside (but in explanation of why I suggested using changes in encoding not changes in comparison, and a reason why I suggested using libraries to do that) ordering utf16 code units doesn't always give absolutely correct ordering of code points. Just one of the reasons why utf16 is a horrible encoding... Neither your code nor a simple conversion to big endian deals with the problem of code point order (which, admittedly, doesn't matter for most strings). If you wanted to get it absolutely correct you could steal the comparison function presented here: https://ssl.icu-project.org/docs/papers/utf16_code_point_order.html ... OR you could concede that a key-value-store library wrapper is the wrong place for this and use icu iconv etc to convert to either utf8 or utf32, both of which sort perfectly using the built in lmdb comparison method...

The above approach (always convert the application key type to a byte string that compares correctly) is used in many LMDB projects (not node specific).

And will end up that, for example, Russian or Belarusian words will be out of order.

No it doesn't. This isn't some theoretical discussion. It really works - why don't you try it.
Because Cyrillic is in block 4 and the modifiers in block 3 you won't hit (in Russian) any of the quirks in code point vs code unit ordering, so a simple big-endian conversion is all that you need.

Again, this would not break the compatibility since original (embedded in LMDB) comparison compares single byte strings just the same as UTF16 Little Endian.

You keep talking about single byte strings, by which I presume you mean ones less than \u0100. Even in this case your assertion it will "just work" is incorrect. And in any case you are not the only user of characters outside of latin-1.

And, actually, since LMDB is a library, not a stand-alone database, I would not say it really matters if you comply binary compatibility with something else, like OpenLDAP.

I disagree because "binary compatibility with something else" includes cases like (this is a real-world example):

Test/utility code written in Python.

High performance code written in C++.

Less compute intensive code written in node.js.

All 3 access the same LMDB database, in some cases concurrently.

The problem would also arise when simply mixing access to the same database between node.js programs using versions of node-lmdb with/without this change applied.

@Venemo
Copy link
Owner

Venemo commented Jun 21, 2017

@degifted @da77a Here is my opinion about this pull request:

Unfortunately, it breaks compatibility with existing users of node-lmdb, so I can't accept it as-is.

And then, there is a LOT more to string sorting. The code might cover your specific use case in your language, but it is not a universal solution because it is not aware of the locale / language of the users (or the web site you use it with). Different languages have different rules for sorting characters like o ó ö ő ô etc. So to do proper sorting, one would need to use something like ICU, and specify which locale are the strings you sort.

I believe the best way to solve this is to create a wrapper for mdb_set_compare, thus allowing the user to sort how they like. Then you could implement this same algorithm on top of that, and at the same time allow other users to benefit from it.

What do you guys think about that?

@da77a
Copy link
Contributor

da77a commented Jun 21, 2017

@degifted @da77a Here is my opinion about this pull request:

Unfortunately, it breaks compatibility with existing users of node-lmdb, so I can't accept it as-is.

I don't need (so don't take it as a request), but would have no issue with, the proposal I made to add a new key type NodeLmdbKeyType::Utf16BeStringKey. This allows users who just want things order as one would expect "by default" based on code units to get on with their lives and interops fine with other systems that store big endian utf-16. It directly addresses @degifted use case without his having to change any code that uses his patch other than to specify that key type.

10 points.

But I haven't got time to write it so I'd understand if a modified @degifted patch was accepted IF the modification applied the custom sort ONLY if requested "somehow" (eg via a different keytype as above) and so addressed the obvious issue for the common case of UTF-16 sorting differently in node.js internally to the way it sorts in node-lmdb string indexes....

5 points.

Different languages have different rules for sorting characters like o ó ö ő ô etc. So to do proper sorting, one would need to use something like ICU, and specify which locale are the strings you sort.

Locale dependent sort is - well - locale dependent. Rebuilding a key value store from scratch to present a correctly sorted list to a Lithuanian is not likely to be a widely called for feature - and in particular wasn't what @degifted proposed. What he did propose was sorting in code unit order. Which is "near enough to correct" for most uses/users. And we can get that without custom comparison (just store your bytes in the right order or use UTF-8 where there is only on order...).

But sure, one could support it with high performance (but in a really difficult to configure and likely to surprise way) by including ICU C code (please make it an opt in build option not default). I'll assume this is unlikely to get done...

1 point.

I believe the best way to solve this is to create a wrapper for mdb_set_compare, thus allowing the user to sort how they like. Then you could implement this same algorithm on top of that, and at the same time allow other users to benefit from it.

But the performance would be bad (having to get the keys to be compared to/from javascript + the actual javascript compare code performance). And the fact the feature was there and no other feature was there to correctly (same as default javascript sort) order utf-16 code units would encourage this "fix" being used way more often than it should be.

-2 points.

What do you guys think about that?

See points above... What I'm trying to say is - add the custom compare support, if at all, only after adding some default "reasonable" behaviour for utf16 strings (i.e. code unit order).

@Venemo
Copy link
Owner

Venemo commented Jun 21, 2017

@da77a That sounds very reasonable. You convinced me!

@degifted Would it be okay for you to modify the patch so that the change would be triggered by a different, new key type?

@degifted
Copy link
Author

Thanks guys for such a comprehensive covering of this topic. Let me summarize my thoughts on that.

  1. As I said, this patch obviously will not break binary compatibility of the db if a table contains only lower code UTF16 characters (code<255). This can be worked around by switching off that UTF16 comparison, to defer to standard LMDB single byte comparison. To my opinion it is better to control which comparison to use via options passed to Dbi::open, rather than introducing a new key (and, as a result, bunch of extra API methods). To be sure that no one will hurt, let it be disabled by default.
  2. LE/BE handling is not required since V8 stores strings in a native endianness mode.
  3. I understand that this comparison is far from perfection, but my key point was the performance. Any extra computation in that function will have noticeable impact on the overall performance. But nevertheless the unicode charsets are designed to be as in order as possible, so in most cases we are good. A user should understand that if she enables that UTF16 simple comparison, she will get mostly sorted keys suitable for searching the nearest neighbor of an incomplete word, but nothing much beyond that. To get truly sorted list of keys one should, for example, generate an index table which addresses main table in a correct order, with aid of iconv, ICU and whatever she wants.
  4. A wrapper for mdb_set_compare would kill performance, not a good idea at all.

So, I propose to leave this comparison fn as is (well, probably add fix-up as in the IBM code above but I doubt it is really necessary in that case) and add an option to switch it on/off.

@da77a
Copy link
Contributor

da77a commented Jun 22, 2017

@degifted - I like your plan. My main point was (and also your point 3 above) is that the general solution is to build indexes by encoding keys through whatever reversible transformation function (be it table lookup or endian conversion or...). This doesn't need support from lmdb and by extension node-lmdb. And you have no plans to do anything in that area which is a good outcome from my perspective. That leaves the question of default behaviour and ease of use. The existing string key type is intended to provide sane results for naive use of lmdb from node.js. The accidental (I assume) default of byte string comparing big endian UTF-16 produces a collation that isn't terribly useful, but its what every existing instance of the default/common use of node-lmdb has produced. And even in what might be assumed (even by naive users) to be "8 bit" strings there are often, in UTF-16, stray chars outside the 0..\u0100 range (currency, various forms of quotes, hyphens, other chars depending on which latin codepage was being mapped from to produce that UTF encoding...) hence the need for a configurable option.
I am not a fan of expanding the key type concept beyond the native LMDB types of fixed length int vs byte string so you don't have to convince me to do the configuring another way.

@Venemo
Copy link
Owner

Venemo commented Jun 22, 2017

@degifted @da77a Thank you guys for the great discussion. The idea to add another flag to dbi.open didn't occour to me, but sounds good and sounds like the correct™ way to do it!

I imagine something like this:

dbi = env.openDbi({
    name: "mydbi",
    keySort: "utf16string"
});

This would be future-proof in the sense that later on we could add other keySort possibilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants