Improved Polish stemmer from #159 #220

mitya57 · 2025-03-14T10:55:02Z

I took the stemmer which was originally written by @tomek-ai in #159 and rewrote it using substring checks.

The original commit is preserved (but rebased), my modifications are in separate commits for review purposes (please feel free to squash when merging). My first commit does not cause behavior changes, the subsequent commits do.

There is some room for improvement, e.g. the stemmer does not recognize all possible superlative forms (naj-), but it handles simpler cases quite well.

The corresponding snowball-data PR is snowballstem/snowball-data#30.

algorithms/polish.sbl

ojwb · 2025-03-14T19:22:23Z

Thanks, this is definitely an improvement. Especially that it's now mostly working backwards rather than trying to remove suffixes but while working forwards.

If you want, I can add explanatory comments what each ending means, maybe with examples.

That would be very useful. We really need some documentation of the algorithm or else it's much harder to maintain going forwards because we have to guess at the intent; having meanings and examples for each ending would cover a significant part of that.

algorithms/polish.sbl

mitya57 · 2025-03-14T20:38:28Z

If linguistically -zacja, etc are different endings we can just note in comments that 'acja' handles both -acja and -zacja.

Linguistically -zacja is borrowed English/Latin words -sation (-zation), which don’t differ much from the remaining -ation words. So maybe it doesn't even deserve a comment.

mitya57 · 2025-03-15T21:46:20Z

Updated to add comments and examples (I hope I can use unicode in comments? The Lithuanian stemmer also has them). And addressed all your other comments.

Unfortunately the Ada tests fail. I have no experience with that language, but I can take a look tomorrow.

Also should I replace tabs with spaces in indentation? The original file used tabs, but I see that most existing stemmers (all except Nepali) use spaces.

mitya57 · 2025-03-16T08:45:53Z

Regarding Ada tests: it fails on delete command on [tomark sizeof 'naj'] delete line.

This patch makes the tests pass, but I'm not sure if it's the proper fix:

--- a/ada/src/stemmer.ads
+++ b/ada/src/stemmer.ads
@@ -181,11 +181,11 @@ private
                       S          : in String;
                       Adjustment : out Integer) with
      Global => null,
-     Pre => C_Bra >= Context.Lb and C_Ket >= C_Bra and C_Ket <= Context.L;
+     Pre => C_Ket >= C_Bra and C_Ket <= Context.L;
 
    procedure Slice_Del (Context : in out Context_Type'Class) with
      Global => null,
-     Pre => Context.Bra >= Context.Lb and Context.Ket >= Context.Bra
+     Pre => Context.Ket >= Context.Bra
      and Context.Ket <= Context.L;
 
    procedure Slice_From (Context : in out Context_Type'Class;

ojwb · 2025-03-16T21:06:41Z

Those Ada preconditions should not include anything to do with Context.L or Context.Lb - the manual says the limits apply to cursor movement, so it should be allowed to modify a slice outside of the limits.

Dropping the invalid preconditions is easy, but the Ada code also seems to use Context.L in place of tracking the current length of the string. That's probably OK for the final result (setlimit C1 for C2 changes the limit for C2, but then it gets reverted, so the limits on exit should be the start and end of the final string) but it means we don't know where the actual end of the string is so a delete, insert, etc doesn't know how much of the tail of the string to copy. Just removing the pre-condition will make the code buggy if there's a forwards limit (i.e. Context.L < the current true length of the string). So I think we need to separately track the string length. I'll sort out a fix. Anyway I agree this doesn't look like a problem with polish.sbl.

Unicode is fine in comments (sources should be UTF-8 with Unix line endings).

The intention is to only use spaces for indentation except where there's a syntax reason (e.g. GNUmakefile) or a strong convention (e.g. Go standard seems to be tabs: gofmt reindents code with tabs). I've just pushed 7a0e9ed to clean up the deviations from this.

algorithms/polish.sbl

ojwb · 2025-03-17T05:35:08Z

Dropping the invalid preconditions is easy, but the Ada code also seems to use Context.L in place of tracking the current length of the string. That's probably OK for the final result (setlimit C1 for C2 changes the limit for C2, but then it gets reverted, so the limits on exit should be the start and end of the final string) but it means we don't know where the actual end of the string is so a delete, insert, etc doesn't know how much of the tail of the string to copy. Just removing the pre-condition will make the code buggy if there's a forwards limit (i.e. Context.L < the current true length of the string). So I think we need to separately track the string length. I'll sort out a fix. Anyway I agree this doesn't look like a problem with polish.sbl.

Dear me, quite a lot of wrong stuff in Ada runtime support (and the generator too). I've pushed fixes for the stuff that's relevant here so hopefully the next build will be all green.

algorithms/polish.sbl

ojwb · 2025-03-20T02:31:29Z

Snowball is currently all licensed as 3-clause BSD, and I really don't want to start to allow a proliferation of licences. MIT/X is a broadly equivalent licence, but it still makes it harder for users to understand the licensing situation.

The original PR author says in snowballstem/snowball-website#29 (comment) :

The implementation in Snowball was created independently, using the same principles but rewritten from scratch and adapted to Snowball’s structure.

Copyright applies to the implementation of an algorithm, so a rewrite from scratch can use a different choice of licence.

(We should of course acknowledge the origin of the algorithm, but we don't need to use the same licence to do so.)

In case they were intended for -ę words like zwierzę, they had more false positives than true ones. And that declension is quite rare.

mitya57 · 2025-03-22T20:32:32Z

I pushed a new version that hopefully addresses all your comments and all my own TODO comments.

I had a temptation to rewrite the stemmer to avoid repeating adjective-like endings in three places (for adjectives, superlatives and participles) and even started doing that in mitya57/snowball@0e54859, but those changes were too disruptive, so in the end I took a more conservative approach which was extending the current algorithm.

mitya57 · 2025-03-22T21:58:32Z

And a question. Currently we have logic like:

(
  $(len > 6)
  // remove 4 characters
) or (
  $(len > 5)
  // remove 3 characters
)

But what this code really tries to check is that after removing the endings, the resulting stem is not too short (e.g. there are at least 2 characters left). What is the most elegant way to achieve this?

Other Slavic stemmers try to operate in terms of syllables rather than letters, maybe we should try to do the same here?

ojwb · 2025-03-23T04:42:14Z

But what this code really tries to check is that after removing the endings, the resulting stem is not too short (e.g. there are at least 2 characters left). What is the most elegant way to achieve this?

Most of the other stemmers set up regions in advance and then test those. Usually these are based on syllables, but a length based version would look like:

$p1 = limit
do (hop 2 setmark p1)

where p1 is an integer variable:

integers (p1)

and (usually) declare a helper routine to test for the region:

define R1 as $p1 <= cursor

Then there's a choice for how you apply it - either:

[substring] R1 among (
  //...
)

or:

setlimit tomark p1 for ([substring]) among (
  //...
)

The difference is that the first finds the longest matching suffix from those in the among and then checks if it is in R1, while the second finds the longest matching suffix in R1 from those in the among.

E.g. the English stemmer's Step_2 contains suffixes fulli and li (note that -y has been changed to -i before this setp) so that gives stems like so with the current [substring] R1 among:

fully                         fulli

fulli is the longest suffix, but it's not entirely in R1 so we don't remove it.

If that used setlimit tomark p1 for ([substring]) instead we'd get:

fully                         ful

Due to the limit, the string the among actually tests against is lli so suffix li is the longest.

I don't really have any useful insights as to which to use when - I just end up trying both for each situation and seeing which works better.

Other Slavic stemmers try to operate in terms of syllables rather than letters, maybe we should try to do the same here?

Most of the stemmers work in terms of syllables. It does seem a better approach, though the difference it makes will depends on the language - if most words alternate vowel and consonant then it'll not make as much difference.

Most languages use the same R1 definition as porter.sbl, or the slight variant in german.sbl which imposes a minimum of 3 characters before R1. There are also various other minor and major variations though: english.sbl has an exception list; czech.sbl (not merged yet) takes syllabic consonants into account; the Kraaij-Pohlmann dutch stemmer treats the ij digraph as a vowel rather than a vowel followed by a consonant; Hungarian has totally different R1 definition; Indonesian instead just counts the number of vowels; in Serbian for words that starts with two or more vowels, R1 starts after those vowels but there's also some special handling of words containing r; Yiddish is "Either 3 consonants or the first non-vowel after a vowel".

One of the first two is probably a good starting point unless there's something that Polish has in common with one of the other languages that's relevant here. The stemmer-compare script allows comparing the results.

I notice though that the difference between the length threshold and the suffix length is not the same for all suffixes - it looks like it can be 1, 2 or 3. That may essentially be a bug, or actually make not difference in practice. The 1 vs 3 could be equivalent to R1 vs R2 (R2 is another syllable after R1).

(Incidentally, len in Snowball is O(n) when working in UTF-8. Here n is the length of a word so not large, but it's a bit silly to repeatedly call len like the code currently does. This was something I thought best to leave for now as it's really just an optimisation; also the Snowball compiler could probably be smarter and reuse the calculated length when it can see the string hasn't changed...)

We were removing part of the root here. Also cases other than nominative were not handled.

mitya57 · 2025-03-23T12:56:19Z

Updated once again. This time I also tried to solve problem that the original stemmer removed some d-suffix but did not support all possible i-suffices for it. In particular, I added possible declension forms for diminutives and for -owy adjectives.

Also, I added support for more verbal forms and made the list of verbal forms better structured (e.g. all past forms are grouped together).

The corresponding data changes are in snowballstem/snowball-data#30, each one in a separate commit to make it easier to review.

ojwb · 2025-03-26T04:36:10Z

algorithms/polish.sbl

+      'ach'  // plural locative
+      'ami'  // plural instrumental
+      'nia'  // genitive of words ending with -nie, e.g. czytanie - czytania
+      'niu'  // dative/locative of words ending with -nie, e.g. czytanie - czytaniu


czytaniu would actually be handled by 'aniu' on line 62.

ojwb · 2025-03-26T04:39:57Z

algorithms/polish.sbl

+      'niu'  // dative/locative of words ending with -nie, e.g. czytanie - czytaniu
+      'cia'  // genitive of words ending with -cie, e.g. ukrycie - ukrycia
+      'ciu'  // dative/locative of words ending with -cie, e.g. ukrycie - ukryciu
+      'ce'   // dative/locative of words ending with -ka, e.g. muzyka - muzyce


I notice muzyka stems to muzyk while muzyce stems to muz - not necessary a problem, but if it's unexpected or more widespread this might be worth looking at.

ojwb · 2025-03-26T04:49:47Z

algorithms/polish.sbl

+      '{ak}ca'     // nominative singular (feminine)
+      '{ak}cej'    // genitive/dative/locative singular (feminine)
+      '{ak}c{ak}'  // accusative singular (feminine)
+      '{ak}ce'     // nominative/accusative singular (neuter)


czytające will actually get -ce removed by remove_nouns then -ą removed by remove_general_ends, rather than by this rule. So this may be redundant, but perhaps -ące can be exposed by removal of an ending in a previous step.

ojwb · 2025-03-26T04:51:12Z

algorithms/polish.sbl

+      '{ak}c'      // contemporary adverbial participle (transgressive), e.g. czytając
+      // different forms of active adjectival participle, e.g. czytający (one who is reading)
+      '{ak}cy'     // nominative singular (masculine)
+      '{ak}cego'   // genitive singular (masculine/neuter)


czytającego will actually get -ego removed by remove_adjective_ends then -ąc removed by the rule in this among (line 149).

ojwb · 2025-03-26T04:52:00Z

algorithms/polish.sbl

+      // different forms of active adjectival participle, e.g. czytający (one who is reading)
+      '{ak}cy'     // nominative singular (masculine)
+      '{ak}cego'   // genitive singular (masculine/neuter)
+      '{ak}cemu'   // dative singular (masculine)


Very similar to the -ego case just above.

ojwb · 2025-03-26T04:53:18Z

algorithms/polish.sbl

+      '{ak}cemu'   // dative singular (masculine)
+      '{ak}cym'    // instrumental/locative singular (masculine)
+      '{ak}ca'     // nominative singular (feminine)
+      '{ak}cej'    // genitive/dative/locative singular (feminine)


Also similar to the -ego case.

ojwb · 2025-03-26T04:54:46Z

algorithms/polish.sbl

+      '{ak}cej'    // genitive/dative/locative singular (feminine)
+      '{ak}c{ak}'  // accusative singular (feminine)
+      '{ak}ce'     // nominative/accusative singular (neuter)
+      '{ak}cych'   // genitive plural


Similar to the -ego case.

ojwb · 2025-03-26T05:02:06Z

algorithms/polish.sbl

+      '{ak}cymi'   // instrumental plural
+        (delete)
+      'aj'     // imperative, e.g. czytaj
+         (<- 'a')


czytając (example given further above) stems to czytaj, but czytaj gets the -j removed here and then -a removed by remove_general_ends to give stem czyt.

They both seem to be forms of czytać and that also stems to czyt which kind of suggests maybe we ought to also be removing the aj from czytając?

Maybe that causes other problems though.

ojwb · 2025-03-26T05:06:14Z

algorithms/polish.sbl

+    setlimit tomark p1 for ([substring]) among (
+      '{o'}w'  // genitive plural, e.g. włos - włosów
+      'om'     // dative plural, e.g. włos - włosom
+      'ami'    // instrumental plural, e.g. włos - włosami


-ami actually gets removed from włosami by remove_nouns (noted as "plural instrumental" there). Maybe this is needed if -ami can be exposed by removing another suffix but we also need to remove -ami earlier to expose suffixes, but if so that kind of suggests the order of rules might not be quite right.

mitya57 · 2025-03-26T13:42:44Z

Thank you for the comments! Please don't wait for me if you want to get a new release out, I will fix them eventually.

As I will need to rework it again anyway, I want to ask some questions myself first:

Currently we call all remove_* functions with do syntax, which means they are all called subsequently. However this does not always make sense: e.g. if we removed noun endings, it does not make much sense to remove verb or adjective endings, since a word cannot be noun and verb/adjective at the same time. The only scenarios where this may make sense are:
- when a word is an adjective, e.g. czytającego, we may first remove an adjective ending -ego from it, but after that also remove participle-specific preceding part (-ąc- or -jąc-).
- in the end, maybe it makes sense to cleanup all ending vowels, to make sure that the algorithm consistently produces stems ending with a consonant.
Maybe I should restructure the stem function to use or instead of do?
As I mentioned above, the original algorithm that I took as a base did not handle all inflectional suffixes well, but at the same time tried to work with derivational suffixes (-an- and -acj- in nouns, diminutive suffixes, -ow- in adjectives). Maybe we should focus on one job (inflectional suffixes) and do it well? This will reduce the number of possible combinations and will make the code cleaner.
In my first question I mentioned removing ending vowels. But do we really want to do that as the unconditional last step? If a vowel is at end of the word then sure, it should be removed. But for example if we take the verb czytać, then all forms of it contain the a that I highlighted. Maybe we should stem it to czyta in this case?

ojwb · 2025-04-29T04:59:47Z

Sorry, I got rather swallowed up trying to make progress towards a new release.

Thank you for the comments! Please don't wait for me if you want to get a new release out, I will fix them eventually.

Thanks - I am currently leaning towards getting a new release out (it's very nearly 3.5 years since the last one which is too long and there's a huge pile of unreleased useful changes), and then dealing with the pending new algorithms and making another release with those. As well as this there's Czech (#151) mostly needing documentation writing, Persian (#194) waiting for review, and Ukrainian (#144) waiting for the submitter.

As I will need to rework it again anyway, I want to ask some questions myself first:

1. Currently we call all `remove_*` functions with `do` syntax, which means they are all called subsequently. However this does not always make sense: e.g. if we removed noun endings, it does not make much sense to remove verb or adjective endings, since a word cannot be noun and verb/adjective at the same time. The only scenarios where this may make sense are:
   
   * when a word is an adjective, e.g. _czytającego_, we may first remove an adjective ending _-ego_ from it, but after that also remove participle-specific preceding part (_-ąc-_ or _-jąc-_).
   * in the end, maybe it makes sense to cleanup all ending vowels, to make sure that the algorithm consistently produces stems ending with a consonant.
   
   Maybe I should restructure the `stem` function to use `or` instead of `do`?

If there's a clean structure to the suffixes in the language then reflecting that in the code makes sense, and eliminating checks for suffixes that can never be present seems a good idea - more efficient and eliminates the risk of removing something that looks like a suffix when it isn't.

However you may find there are cases the suffix is removed but it's just how a word ends, and removing other suffixes incorrectly actually helps to conflate forms of a word together, albeit on a stem which isn't linguistically correct (which is OK for our purposes provided it doesn't collide with stems of unrelated words). This is quite common in English (which is really a mess of a language with a lot of loan words and the like from many other languages), but I've seen it in several others too. I saw a good example of this just the other day in an old mailing list post, but I can't now find it or think of a good example myself.

2. As I mentioned above, the original algorithm that I took as a base did not handle all inflectional suffixes well, but at the same time tried to work with derivational suffixes (_-an-_ and _-acj-_ in nouns, diminutive suffixes, _-ow-_ in adjectives). Maybe we should focus on one job (inflectional suffixes) and do it well? This will reduce the number of possible combinations and will make the code cleaner.

The key goal is to improve retrieval (without harming precision too much), so inflectional suffixes are probably the most critical.

Diminutive suffices can change meaning enough that conflating or not are both reasonable. The linguistic structure may steer one to a conclusion which isn't fully supported by the meanings of the words - E.g. in English, a young dog is a "puppy" and not conflating seems the obvious choice, but in German a young "Hund" is a "Hündchen" - the structure of the words perhaps suggests conflating. The meanings are probably similarly different in the two languages (I'm not fluent enough in German to say for sure) and the difference is really that in English you'd need a special case, while in German you could conceivably have a -chen rule that also handled "Kätzchen", etc.

-acj- is like English "ation", right? That seems more useful than diminutives at least.

Not sure what -an- and -ow- mean.

3. In my first question I mentioned removing ending vowels. But do we really want to do that as the unconditional last step? If a vowel is at end of the word then sure, it should be removed. But for example if we take the verb _czyt**a**ć_, then all forms of it contain the _a_ that I highlighted. Maybe we should stem it to _czyta_ in this case?

Returning czyt as the stem is fine for our purposes, provided it doesn't result in conflating unrelated words. Thinking about English words ending "e", e.g. "pasting` comes from "paste" which has an entirely different meaning to "past". The stem for all of these was "past", but I've added an exception for the next release to address this. The English stemmer goes to some trouble to try to remove or re-add -e but was still getting this case wrong.

mitya57 · 2025-04-29T06:51:26Z

Thank you for the detailed response!
I also had little time in the last few weeks, but I hope I will find some time in May to make a new version of this PR.

ojwb mentioned this pull request Mar 14, 2025

Add the initial version of the Polish stemmer #159

Closed

ojwb reviewed Mar 14, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

mitya57 force-pushed the polish-stemmer-2 branch from 4f19ad7 to 6ef006b Compare March 14, 2025 19:23

ojwb reviewed Mar 14, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

algorithms/polish.sbl Outdated Show resolved Hide resolved

algorithms/polish.sbl Outdated Show resolved Hide resolved

mitya57 force-pushed the polish-stemmer-2 branch from 6ef006b to 9e37dcb Compare March 15, 2025 21:35

mitya57 force-pushed the polish-stemmer-2 branch from 9e37dcb to 42626da Compare March 16, 2025 08:16

mitya57 force-pushed the polish-stemmer-2 branch from 42626da to ac4acc1 Compare March 16, 2025 21:15

ojwb reviewed Mar 16, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

ojwb reviewed Mar 16, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

ojwb reviewed Mar 16, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

ojwb reviewed Mar 16, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

ojwb reviewed Mar 16, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

mitya57 force-pushed the polish-stemmer-2 branch from ac4acc1 to 8056aa8 Compare March 17, 2025 12:32

ojwb reviewed Mar 18, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

ojwb reviewed Mar 19, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

ojwb reviewed Mar 19, 2025

View reviewed changes

algorithms/polish.sbl Outdated Show resolved Hide resolved

mitya57 force-pushed the polish-stemmer-2 branch from 8056aa8 to 315e7c4 Compare March 19, 2025 08:18

tomeklysiak and others added 6 commits March 22, 2025 14:19

Add the initial version of the Polish stemmer

2616c59

polish: Rewrite using [substring], add comments

882eee2

polish: Remove -ta and -tach rules

c936967

In case they were intended for -ę words like zwierzę, they had more false positives than true ones. And that declension is quite rare.

polish: Uncomment the rule for -rze

e37722c

polish: Add more possible endings for adjectives and participles

b7650b5

polish: Fix TODO comments about verbs

7d48877

polish: Add more verbal forms

073537e

mitya57 force-pushed the polish-stemmer-2 branch from 315e7c4 to 073537e Compare March 22, 2025 20:20

mitya57 added 6 commits March 23, 2025 15:05

polish: Replace length checks with syllable checks

7694886

polish: Add support for -ąć infinitives

a9bce12

polish: Add support for more verbal forms

5512a11

polish: Add support for more endings of -ow- adjectives

e32bd4f

polish: Remove rule for "-czny"

9d634e4

We were removing part of the root here. Also cases other than nominative were not handled.

polish: Add declension for diminutives

5a0af2c

ojwb reviewed Mar 26, 2025

View reviewed changes

ojwb added this to the 3.1.0 milestone Apr 29, 2025

Improved Polish stemmer from #159 #220

Are you sure you want to change the base?

Improved Polish stemmer from #159 #220

Uh oh!

Conversation

mitya57 commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ojwb commented Mar 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mitya57 commented Mar 14, 2025

Uh oh!

mitya57 commented Mar 15, 2025

Uh oh!

mitya57 commented Mar 16, 2025

Uh oh!

ojwb commented Mar 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ojwb commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ojwb commented Mar 20, 2025

Uh oh!

mitya57 commented Mar 22, 2025

Uh oh!

mitya57 commented Mar 22, 2025

Uh oh!

ojwb commented Mar 23, 2025

Uh oh!

mitya57 commented Mar 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitya57 commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ojwb commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mitya57 commented Apr 29, 2025

Uh oh!

Uh oh!

mitya57 commented Mar 14, 2025 •

edited

Loading

mitya57 commented Mar 26, 2025 •

edited

Loading

ojwb commented Apr 29, 2025 •

edited

Loading