Remove diacritics when comparing case insensitive strings. #2084

carlesalbasboix · 2025-05-09T20:38:41Z

Bank statements usually have inconsistent diacritics. I propose removing them when performing a case-insensitive search.

I've only added the diacritics that I'm interested in but I can add more.

jralls · 2025-05-09T23:41:59Z

What's the use exact use case and how do you propose to protect users who don't want to ignore diacritics? Have you examined every single use of the string predicate to ensure that this change affects your use case and no other?

carlesalbasboix · 2025-05-10T07:27:01Z

In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería".

I set it so that diacritics only get ignored when matching case insensitive. Maybe we could rename the checkbox to "Match case and diacritics"? I don't think that matching diacritics but ignoring case is common. Otherwise I can add another separate checkbox.

jralls · 2025-05-10T17:59:07Z

In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería".

You seem to have ignored the diacritics yourself, there's no diacritic in either string (never mind that I couldn't find an example spelling of pasteleria or patisseria that contains diacritics anyway though the French write pâtisserie).

Anyway, I get the idea. The problem is the "I guess in other languages too" because there may be some languages where the same spelling with and without diacritics, or with different diacritics, are completely different words and this change would create false matches.

Another problem is that case-insensitive query is used in two places: One is the Find dialog that I suspect is your concern, but the other is in the register's Invoice/Bill autocompletion feature where it's unconditional.

I think you could fix both instances by testing the input string for diacritics and modifying the output string only if there are none. That way if the user types a diacritic they get the diacritic match, but if they don't then it's ambiguous and the query can sensibly return matches with and without the diacritic.

Obviously all diacritics in all languages need to be included, otherwise we'll get complaints about inconsistency.

A technical note: Diacritics can be represented two ways so you need to call g_utf8_normalize(utf8_string, -1, G_NORMALIZE_NFC) on both strings.

Please review the coding standard, in particular that we place open braces ({) on a separate line unless the closing brace can go on the same line.

carlesalbasboix · 2025-05-10T19:53:44Z

There is indeed a diacritic on the second string, on the "í". That's the proper spelling in Spanish.

What about including another checkbox to ignore diacritics? I think it's the most transparent. I can also add a more exhaustive list of diacritics if I know that the PR will be accepted.

jralls · 2025-05-10T20:08:18Z

Another checkbox in the Find dialog is fine, but you still need a way to distinguish in the Invoice/Bill autocompletion where there is no dialog box. I suppose you could add an item in Preferences>Business, though @gjanssens might object.

jralls · 2025-05-10T20:10:26Z

BTW, you do know that this will work in only those two places, right? There are a lot of other string matching functions that don't involve QofQuery.

christopherlam · 2025-05-11T02:19:31Z

Shouldn't this be a locale-specific change? And should use ICU?

jralls · 2025-05-11T03:47:51Z

Inserting ICU into the middle of a g_utf8 workflow doesn't make a lot of sense.

As for locale sensitivity, hmm. That implies a bunch of locale-specific hash tables. There would be a performance benefit as it would reduce number of diacritic pairs to the ones used in the current locale but it would increase the complexity a bit.

carlesalbasboix · 2025-05-11T06:50:02Z

Technically std::unordered_map::find() has average time complexity O(1) so the number of diacritics shouldn't impact performance.

jralls · 2025-05-11T15:58:42Z

Technically std::unordered_map::find() has average time complexity O(1) so the number of diacritics shouldn't impact performance.

Hash tables like std::unordered_map have two levels of storage and so two levels of complexity. The outer level of storage is the hash table itself. The find function is traversing a BTree so complexity is O(log2 N) where N is the number of buckets. The inner level is a linked list of all items that have the same hash and it's complexity is O(n) where n is the number of items in the bucket. Traversal of both the BTree and the list is accomplished by pointer chasing, but since in this case the whole hash table is created at once the storage locations should be pretty close together and so won't have the cache-miss problem inherent to such containers loaded one item at a time. But size still does matter and a larger table takes longer to search than a smaller one.

Remove diacritics when comparing case insensitive strings.

31894b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove diacritics when comparing case insensitive strings. #2084

Remove diacritics when comparing case insensitive strings. #2084

carlesalbasboix commented May 9, 2025

jralls commented May 9, 2025

carlesalbasboix commented May 10, 2025 •

edited

Loading

jralls commented May 10, 2025

carlesalbasboix commented May 10, 2025

jralls commented May 10, 2025

jralls commented May 10, 2025

christopherlam commented May 11, 2025

jralls commented May 11, 2025

carlesalbasboix commented May 11, 2025

jralls commented May 11, 2025

Remove diacritics when comparing case insensitive strings. #2084

Are you sure you want to change the base?

Remove diacritics when comparing case insensitive strings. #2084

Conversation

carlesalbasboix commented May 9, 2025

jralls commented May 9, 2025

carlesalbasboix commented May 10, 2025 • edited Loading

jralls commented May 10, 2025

carlesalbasboix commented May 10, 2025

jralls commented May 10, 2025

jralls commented May 10, 2025

christopherlam commented May 11, 2025

jralls commented May 11, 2025

carlesalbasboix commented May 11, 2025

jralls commented May 11, 2025

carlesalbasboix commented May 10, 2025 •

edited

Loading