-
Notifications
You must be signed in to change notification settings - Fork 840
Remove diacritics when comparing case insensitive strings. #2084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: stable
Are you sure you want to change the base?
Conversation
What's the use exact use case and how do you propose to protect users who don't want to ignore diacritics? Have you examined every single use of the string predicate to ensure that this change affects your use case and no other? |
In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería". I set it so that diacritics only get ignored when matching case insensitive. Maybe we could rename the checkbox to "Match case and diacritics"? I don't think that matching diacritics but ignoring case is common. Otherwise I can add another separate checkbox. |
You seem to have ignored the diacritics yourself, there's no diacritic in either string (never mind that I couldn't find an example spelling of pasteleria or patisseria that contains diacritics anyway though the French write pâtisserie). Anyway, I get the idea. The problem is the "I guess in other languages too" because there may be some languages where the same spelling with and without diacritics, or with different diacritics, are completely different words and this change would create false matches. Another problem is that case-insensitive query is used in two places: One is the Find dialog that I suspect is your concern, but the other is in the register's Invoice/Bill autocompletion feature where it's unconditional. I think you could fix both instances by testing the input string for diacritics and modifying the output string only if there are none. That way if the user types a diacritic they get the diacritic match, but if they don't then it's ambiguous and the query can sensibly return matches with and without the diacritic. Obviously all diacritics in all languages need to be included, otherwise we'll get complaints about inconsistency. A technical note: Diacritics can be represented two ways so you need to call Please review the coding standard, in particular that we place open braces ( |
There is indeed a diacritic on the second string, on the "í". That's the proper spelling in Spanish. What about including another checkbox to ignore diacritics? I think it's the most transparent. I can also add a more exhaustive list of diacritics if I know that the PR will be accepted. |
Another checkbox in the Find dialog is fine, but you still need a way to distinguish in the Invoice/Bill autocompletion where there is no dialog box. I suppose you could add an item in Preferences>Business, though @gjanssens might object. |
BTW, you do know that this will work in only those two places, right? There are a lot of other string matching functions that don't involve QofQuery. |
Shouldn't this be a locale-specific change? And should use ICU? |
Inserting ICU into the middle of a g_utf8 workflow doesn't make a lot of sense. As for locale sensitivity, hmm. That implies a bunch of locale-specific hash tables. There would be a performance benefit as it would reduce number of diacritic pairs to the ones used in the current locale but it would increase the complexity a bit. |
Technically |
Hash tables like std::unordered_map have two levels of storage and so two levels of complexity. The outer level of storage is the hash table itself. The find function is traversing a BTree so complexity is O(log2 N) where N is the number of buckets. The inner level is a linked list of all items that have the same hash and it's complexity is O(n) where n is the number of items in the bucket. Traversal of both the BTree and the list is accomplished by pointer chasing, but since in this case the whole hash table is created at once the storage locations should be pretty close together and so won't have the cache-miss problem inherent to such containers loaded one item at a time. But size still does matter and a larger table takes longer to search than a smaller one. |
Bank statements usually have inconsistent diacritics. I propose removing them when performing a case-insensitive search.
I've only added the diacritics that I'm interested in but I can add more.