Skip to content

Remove diacritics when comparing case insensitive strings. #2084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: stable
Choose a base branch
from

Conversation

carlesalbasboix
Copy link

Bank statements usually have inconsistent diacritics. I propose removing them when performing a case-insensitive search.

I've only added the diacritics that I'm interested in but I can add more.

@jralls
Copy link
Member

jralls commented May 9, 2025

What's the use exact use case and how do you propose to protect users who don't want to ignore diacritics? Have you examined every single use of the string predicate to ensure that this change affects your use case and no other?

@carlesalbasboix
Copy link
Author

carlesalbasboix commented May 10, 2025

In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería".

I set it so that diacritics only get ignored when matching case insensitive. Maybe we could rename the checkbox to "Match case and diacritics"? I don't think that matching diacritics but ignoring case is common. Otherwise I can add another separate checkbox.

@jralls
Copy link
Member

jralls commented May 10, 2025

In Catalan and Spanish (and I guess in other languages too) diacritics are often ignored in bank statements. For example, sometimes they might write "pasteleria" and others "pastelería".

You seem to have ignored the diacritics yourself, there's no diacritic in either string (never mind that I couldn't find an example spelling of pasteleria or patisseria that contains diacritics anyway though the French write pâtisserie).

Anyway, I get the idea. The problem is the "I guess in other languages too" because there may be some languages where the same spelling with and without diacritics, or with different diacritics, are completely different words and this change would create false matches.

Another problem is that case-insensitive query is used in two places: One is the Find dialog that I suspect is your concern, but the other is in the register's Invoice/Bill autocompletion feature where it's unconditional.

I think you could fix both instances by testing the input string for diacritics and modifying the output string only if there are none. That way if the user types a diacritic they get the diacritic match, but if they don't then it's ambiguous and the query can sensibly return matches with and without the diacritic.

Obviously all diacritics in all languages need to be included, otherwise we'll get complaints about inconsistency.

A technical note: Diacritics can be represented two ways so you need to call g_utf8_normalize(utf8_string, -1, G_NORMALIZE_NFC) on both strings.

Please review the coding standard, in particular that we place open braces ({) on a separate line unless the closing brace can go on the same line.

@carlesalbasboix
Copy link
Author

There is indeed a diacritic on the second string, on the "í". That's the proper spelling in Spanish.

What about including another checkbox to ignore diacritics? I think it's the most transparent. I can also add a more exhaustive list of diacritics if I know that the PR will be accepted.

@jralls
Copy link
Member

jralls commented May 10, 2025

Another checkbox in the Find dialog is fine, but you still need a way to distinguish in the Invoice/Bill autocompletion where there is no dialog box. I suppose you could add an item in Preferences>Business, though @gjanssens might object.

@jralls
Copy link
Member

jralls commented May 10, 2025

BTW, you do know that this will work in only those two places, right? There are a lot of other string matching functions that don't involve QofQuery.

@christopherlam
Copy link
Contributor

Shouldn't this be a locale-specific change? And should use ICU?

@jralls
Copy link
Member

jralls commented May 11, 2025

Inserting ICU into the middle of a g_utf8 workflow doesn't make a lot of sense.

As for locale sensitivity, hmm. That implies a bunch of locale-specific hash tables. There would be a performance benefit as it would reduce number of diacritic pairs to the ones used in the current locale but it would increase the complexity a bit.

@carlesalbasboix
Copy link
Author

Technically std::unordered_map::find() has average time complexity O(1) so the number of diacritics shouldn't impact performance.

@jralls
Copy link
Member

jralls commented May 11, 2025

Technically std::unordered_map::find() has average time complexity O(1) so the number of diacritics shouldn't impact performance.

Hash tables like std::unordered_map have two levels of storage and so two levels of complexity. The outer level of storage is the hash table itself. The find function is traversing a BTree so complexity is O(log2 N) where N is the number of buckets. The inner level is a linked list of all items that have the same hash and it's complexity is O(n) where n is the number of items in the bucket. Traversal of both the BTree and the list is accomplished by pointer chasing, but since in this case the whole hash table is created at once the storage locations should be pretty close together and so won't have the cache-miss problem inherent to such containers loaded one item at a time. But size still does matter and a larger table takes longer to search than a smaller one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants