Skip to content

Improved Polish stemmer from #159 #220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

mitya57
Copy link
Contributor

@mitya57 mitya57 commented Mar 14, 2025

I took the stemmer which was originally written by @tomek-ai in #159 and rewrote it using substring checks.

The original commit is preserved (but rebased), my modifications are in separate commits for review purposes (please feel free to squash when merging). My first commit does not cause behavior changes, the subsequent commits do.

There is some room for improvement, e.g. the stemmer does not recognize all possible superlative forms (naj-), but it handles simpler cases quite well.

The corresponding snowball-data PR is snowballstem/snowball-data#30.

@ojwb
Copy link
Member

ojwb commented Mar 14, 2025

Thanks, this is definitely an improvement. Especially that it's now mostly working backwards rather than trying to remove suffixes but while working forwards.

If you want, I can add explanatory comments what each ending means, maybe with examples.

That would be very useful. We really need some documentation of the algorithm or else it's much harder to maintain going forwards because we have to guess at the intent; having meanings and examples for each ending would cover a significant part of that.

@mitya57
Copy link
Contributor Author

mitya57 commented Mar 14, 2025

If linguistically -zacja, etc are different endings we can just note in comments that 'acja' handles both -acja and -zacja.

Linguistically -zacja is borrowed English/Latin words -sation (-zation), which don’t differ much from the remaining -ation words. So maybe it doesn't even deserve a comment.

@mitya57
Copy link
Contributor Author

mitya57 commented Mar 15, 2025

Updated to add comments and examples (I hope I can use unicode in comments? The Lithuanian stemmer also has them). And addressed all your other comments.

Unfortunately the Ada tests fail. I have no experience with that language, but I can take a look tomorrow.

Also should I replace tabs with spaces in indentation? The original file used tabs, but I see that most existing stemmers (all except Nepali) use spaces.

@mitya57
Copy link
Contributor Author

mitya57 commented Mar 16, 2025

Regarding Ada tests: it fails on delete command on [tomark sizeof 'naj'] delete line.

This patch makes the tests pass, but I'm not sure if it's the proper fix:

--- a/ada/src/stemmer.ads
+++ b/ada/src/stemmer.ads
@@ -181,11 +181,11 @@ private
                       S          : in String;
                       Adjustment : out Integer) with
      Global => null,
-     Pre => C_Bra >= Context.Lb and C_Ket >= C_Bra and C_Ket <= Context.L;
+     Pre => C_Ket >= C_Bra and C_Ket <= Context.L;
 
    procedure Slice_Del (Context : in out Context_Type'Class) with
      Global => null,
-     Pre => Context.Bra >= Context.Lb and Context.Ket >= Context.Bra
+     Pre => Context.Ket >= Context.Bra
      and Context.Ket <= Context.L;
 
    procedure Slice_From (Context : in out Context_Type'Class;

@ojwb
Copy link
Member

ojwb commented Mar 16, 2025

Those Ada preconditions should not include anything to do with Context.L or Context.Lb - the manual says the limits apply to cursor movement, so it should be allowed to modify a slice outside of the limits.

Dropping the invalid preconditions is easy, but the Ada code also seems to use Context.L in place of tracking the current length of the string. That's probably OK for the final result (setlimit C1 for C2 changes the limit for C2, but then it gets reverted, so the limits on exit should be the start and end of the final string) but it means we don't know where the actual end of the string is so a delete, insert, etc doesn't know how much of the tail of the string to copy. Just removing the pre-condition will make the code buggy if there's a forwards limit (i.e. Context.L < the current true length of the string). So I think we need to separately track the string length. I'll sort out a fix. Anyway I agree this doesn't look like a problem with polish.sbl.

Unicode is fine in comments (sources should be UTF-8 with Unix line endings).

The intention is to only use spaces for indentation except where there's a syntax reason (e.g. GNUmakefile) or a strong convention (e.g. Go standard seems to be tabs: gofmt reindents code with tabs). I've just pushed 7a0e9ed to clean up the deviations from this.

@ojwb
Copy link
Member

ojwb commented Mar 17, 2025

Dropping the invalid preconditions is easy, but the Ada code also seems to use Context.L in place of tracking the current length of the string. That's probably OK for the final result (setlimit C1 for C2 changes the limit for C2, but then it gets reverted, so the limits on exit should be the start and end of the final string) but it means we don't know where the actual end of the string is so a delete, insert, etc doesn't know how much of the tail of the string to copy. Just removing the pre-condition will make the code buggy if there's a forwards limit (i.e. Context.L < the current true length of the string). So I think we need to separately track the string length. I'll sort out a fix. Anyway I agree this doesn't look like a problem with polish.sbl.

Dear me, quite a lot of wrong stuff in Ada runtime support (and the generator too). I've pushed fixes for the stuff that's relevant here so hopefully the next build will be all green.

@ojwb
Copy link
Member

ojwb commented Mar 20, 2025

Snowball is currently all licensed as 3-clause BSD, and I really don't want to start to allow a proliferation of licences. MIT/X is a broadly equivalent licence, but it still makes it harder for users to understand the licensing situation.

The original PR author says in snowballstem/snowball-website#29 (comment) :

The implementation in Snowball was created independently, using the same principles but rewritten from scratch and adapted to Snowball’s structure.

Copyright applies to the implementation of an algorithm, so a rewrite from scratch can use a different choice of licence.

(We should of course acknowledge the origin of the algorithm, but we don't need to use the same licence to do so.)

@mitya57
Copy link
Contributor Author

mitya57 commented Mar 22, 2025

I pushed a new version that hopefully addresses all your comments and all my own TODO comments.

I had a temptation to rewrite the stemmer to avoid repeating adjective-like endings in three places (for adjectives, superlatives and participles) and even started doing that in mitya57/snowball@0e54859, but those changes were too disruptive, so in the end I took a more conservative approach which was extending the current algorithm.

@mitya57
Copy link
Contributor Author

mitya57 commented Mar 22, 2025

And a question. Currently we have logic like:

(
  $(len > 6)
  // remove 4 characters
) or (
  $(len > 5)
  // remove 3 characters
)

But what this code really tries to check is that after removing the endings, the resulting stem is not too short (e.g. there are at least 2 characters left). What is the most elegant way to achieve this?

Other Slavic stemmers try to operate in terms of syllables rather than letters, maybe we should try to do the same here?

@ojwb
Copy link
Member

ojwb commented Mar 23, 2025

But what this code really tries to check is that after removing the endings, the resulting stem is not too short (e.g. there are at least 2 characters left). What is the most elegant way to achieve this?

Most of the other stemmers set up regions in advance and then test those. Usually these are based on syllables, but a length based version would look like:

$p1 = limit
do (hop 2 setmark p1)

where p1 is an integer variable:

integers (p1)

and (usually) declare a helper routine to test for the region:

define R1 as $p1 <= cursor

Then there's a choice for how you apply it - either:

[substring] R1 among (
  //...
)

or:

setlimit tomark p1 for ([substring]) among (
  //...
)

The difference is that the first finds the longest matching suffix from those in the among and then checks if it is in R1, while the second finds the longest matching suffix in R1 from those in the among.

E.g. the English stemmer's Step_2 contains suffixes fulli and li (note that -y has been changed to -i before this setp) so that gives stems like so with the current [substring] R1 among:

fully                         fulli

fulli is the longest suffix, but it's not entirely in R1 so we don't remove it.

If that used setlimit tomark p1 for ([substring]) instead we'd get:

fully                         ful

Due to the limit, the string the among actually tests against is lli so suffix li is the longest.

I don't really have any useful insights as to which to use when - I just end up trying both for each situation and seeing which works better.

Other Slavic stemmers try to operate in terms of syllables rather than letters, maybe we should try to do the same here?

Most of the stemmers work in terms of syllables. It does seem a better approach, though the difference it makes will depends on the language - if most words alternate vowel and consonant then it'll not make as much difference.

Most languages use the same R1 definition as porter.sbl, or the slight variant in german.sbl which imposes a minimum of 3 characters before R1. There are also various other minor and major variations though: english.sbl has an exception list; czech.sbl (not merged yet) takes syllabic consonants into account; the Kraaij-Pohlmann dutch stemmer treats the ij digraph as a vowel rather than a vowel followed by a consonant; Hungarian has totally different R1 definition; Indonesian instead just counts the number of vowels; in Serbian for words that starts with two or more vowels, R1 starts after those vowels but there's also some special handling of words containing r; Yiddish is "Either 3 consonants or the first non-vowel after a vowel".

One of the first two is probably a good starting point unless there's something that Polish has in common with one of the other languages that's relevant here. The stemmer-compare script allows comparing the results.

I notice though that the difference between the length threshold and the suffix length is not the same for all suffixes - it looks like it can be 1, 2 or 3. That may essentially be a bug, or actually make not difference in practice. The 1 vs 3 could be equivalent to R1 vs R2 (R2 is another syllable after R1).

(Incidentally, len in Snowball is O(n) when working in UTF-8. Here n is the length of a word so not large, but it's a bit silly to repeatedly call len like the code currently does. This was something I thought best to leave for now as it's really just an optimisation; also the Snowball compiler could probably be smarter and reuse the calculated length when it can see the string hasn't changed...)

@mitya57
Copy link
Contributor Author

mitya57 commented Mar 23, 2025

Updated once again. This time I also tried to solve problem that the original stemmer removed some d-suffix but did not support all possible i-suffices for it. In particular, I added possible declension forms for diminutives and for -owy adjectives.

Also, I added support for more verbal forms and made the list of verbal forms better structured (e.g. all past forms are grouped together).

The corresponding data changes are in snowballstem/snowball-data#30, each one in a separate commit to make it easier to review.

'ach' // plural locative
'ami' // plural instrumental
'nia' // genitive of words ending with -nie, e.g. czytanie - czytania
'niu' // dative/locative of words ending with -nie, e.g. czytanie - czytaniu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

czytaniu would actually be handled by 'aniu' on line 62.

'niu' // dative/locative of words ending with -nie, e.g. czytanie - czytaniu
'cia' // genitive of words ending with -cie, e.g. ukrycie - ukrycia
'ciu' // dative/locative of words ending with -cie, e.g. ukrycie - ukryciu
'ce' // dative/locative of words ending with -ka, e.g. muzyka - muzyce
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice muzyka stems to muzyk while muzyce stems to muz - not necessary a problem, but if it's unexpected or more widespread this might be worth looking at.

'{ak}ca' // nominative singular (feminine)
'{ak}cej' // genitive/dative/locative singular (feminine)
'{ak}c{ak}' // accusative singular (feminine)
'{ak}ce' // nominative/accusative singular (neuter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

czytające will actually get -ce removed by remove_nouns then removed by remove_general_ends, rather than by this rule. So this may be redundant, but perhaps -ące can be exposed by removal of an ending in a previous step.

'{ak}c' // contemporary adverbial participle (transgressive), e.g. czytając
// different forms of active adjectival participle, e.g. czytający (one who is reading)
'{ak}cy' // nominative singular (masculine)
'{ak}cego' // genitive singular (masculine/neuter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

czytającego will actually get -ego removed by remove_adjective_ends then -ąc removed by the rule in this among (line 149).

// different forms of active adjectival participle, e.g. czytający (one who is reading)
'{ak}cy' // nominative singular (masculine)
'{ak}cego' // genitive singular (masculine/neuter)
'{ak}cemu' // dative singular (masculine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very similar to the -ego case just above.

'{ak}cemu' // dative singular (masculine)
'{ak}cym' // instrumental/locative singular (masculine)
'{ak}ca' // nominative singular (feminine)
'{ak}cej' // genitive/dative/locative singular (feminine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also similar to the -ego case.

'{ak}cej' // genitive/dative/locative singular (feminine)
'{ak}c{ak}' // accusative singular (feminine)
'{ak}ce' // nominative/accusative singular (neuter)
'{ak}cych' // genitive plural
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the -ego case.

'{ak}cymi' // instrumental plural
(delete)
'aj' // imperative, e.g. czytaj
(<- 'a')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

czytając (example given further above) stems to czytaj, but czytaj gets the -j removed here and then -a removed by remove_general_ends to give stem czyt.

They both seem to be forms of czytać and that also stems to czyt which kind of suggests maybe we ought to also be removing the aj from czytając?

Maybe that causes other problems though.

setlimit tomark p1 for ([substring]) among (
'{o'}w' // genitive plural, e.g. włos - włosów
'om' // dative plural, e.g. włos - włosom
'ami' // instrumental plural, e.g. włos - włosami
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-ami actually gets removed from włosami by remove_nouns (noted as "plural instrumental" there). Maybe this is needed if -ami can be exposed by removing another suffix but we also need to remove -ami earlier to expose suffixes, but if so that kind of suggests the order of rules might not be quite right.

@mitya57
Copy link
Contributor Author

mitya57 commented Mar 26, 2025

Thank you for the comments! Please don't wait for me if you want to get a new release out, I will fix them eventually.

As I will need to rework it again anyway, I want to ask some questions myself first:

  1. Currently we call all remove_* functions with do syntax, which means they are all called subsequently. However this does not always make sense: e.g. if we removed noun endings, it does not make much sense to remove verb or adjective endings, since a word cannot be noun and verb/adjective at the same time. The only scenarios where this may make sense are:

    • when a word is an adjective, e.g. czytającego, we may first remove an adjective ending -ego from it, but after that also remove participle-specific preceding part (-ąc- or -jąc-).
    • in the end, maybe it makes sense to cleanup all ending vowels, to make sure that the algorithm consistently produces stems ending with a consonant.

    Maybe I should restructure the stem function to use or instead of do?

  2. As I mentioned above, the original algorithm that I took as a base did not handle all inflectional suffixes well, but at the same time tried to work with derivational suffixes (-an- and -acj- in nouns, diminutive suffixes, -ow- in adjectives). Maybe we should focus on one job (inflectional suffixes) and do it well? This will reduce the number of possible combinations and will make the code cleaner.

  3. In my first question I mentioned removing ending vowels. But do we really want to do that as the unconditional last step? If a vowel is at end of the word then sure, it should be removed. But for example if we take the verb czytać, then all forms of it contain the a that I highlighted. Maybe we should stem it to czyta in this case?

@ojwb
Copy link
Member

ojwb commented Apr 29, 2025

Sorry, I got rather swallowed up trying to make progress towards a new release.

Thank you for the comments! Please don't wait for me if you want to get a new release out, I will fix them eventually.

Thanks - I am currently leaning towards getting a new release out (it's very nearly 3.5 years since the last one which is too long and there's a huge pile of unreleased useful changes), and then dealing with the pending new algorithms and making another release with those. As well as this there's Czech (#151) mostly needing documentation writing, Persian (#194) waiting for review, and Ukrainian (#144) waiting for the submitter.

As I will need to rework it again anyway, I want to ask some questions myself first:

1. Currently we call all `remove_*` functions with `do` syntax, which means they are all called subsequently. However this does not always make sense: e.g. if we removed noun endings, it does not make much sense to remove verb or adjective endings, since a word cannot be noun and verb/adjective at the same time. The only scenarios where this may make sense are:
   
   * when a word is an adjective, e.g. _czytającego_, we may first remove an adjective ending _-ego_ from it, but after that also remove participle-specific preceding part (_-ąc-_ or _-jąc-_).
   * in the end, maybe it makes sense to cleanup all ending vowels, to make sure that the algorithm consistently produces stems ending with a consonant.
   
   Maybe I should restructure the `stem` function to use `or` instead of `do`?

If there's a clean structure to the suffixes in the language then reflecting that in the code makes sense, and eliminating checks for suffixes that can never be present seems a good idea - more efficient and eliminates the risk of removing something that looks like a suffix when it isn't.

However you may find there are cases the suffix is removed but it's just how a word ends, and removing other suffixes incorrectly actually helps to conflate forms of a word together, albeit on a stem which isn't linguistically correct (which is OK for our purposes provided it doesn't collide with stems of unrelated words). This is quite common in English (which is really a mess of a language with a lot of loan words and the like from many other languages), but I've seen it in several others too. I saw a good example of this just the other day in an old mailing list post, but I can't now find it or think of a good example myself.

2. As I mentioned above, the original algorithm that I took as a base did not handle all inflectional suffixes well, but at the same time tried to work with derivational suffixes (_-an-_ and _-acj-_ in nouns, diminutive suffixes, _-ow-_ in adjectives). Maybe we should focus on one job (inflectional suffixes) and do it well? This will reduce the number of possible combinations and will make the code cleaner.

The key goal is to improve retrieval (without harming precision too much), so inflectional suffixes are probably the most critical.

Diminutive suffices can change meaning enough that conflating or not are both reasonable. The linguistic structure may steer one to a conclusion which isn't fully supported by the meanings of the words - E.g. in English, a young dog is a "puppy" and not conflating seems the obvious choice, but in German a young "Hund" is a "Hündchen" - the structure of the words perhaps suggests conflating. The meanings are probably similarly different in the two languages (I'm not fluent enough in German to say for sure) and the difference is really that in English you'd need a special case, while in German you could conceivably have a -chen rule that also handled "Kätzchen", etc.

-acj- is like English "ation", right? That seems more useful than diminutives at least.

Not sure what -an- and -ow- mean.

3. In my first question I mentioned removing ending vowels. But do we really want to do that as the unconditional last step? If a vowel is at end of the word then sure, it should be removed. But for example if we take the verb _czyt**a**ć_, then all forms of it contain the _a_ that I highlighted. Maybe we should stem it to _czyta_ in this case?

Returning czyt as the stem is fine for our purposes, provided it doesn't result in conflating unrelated words. Thinking about English words ending "e", e.g. "pasting` comes from "paste" which has an entirely different meaning to "past". The stem for all of these was "past", but I've added an exception for the next release to address this. The English stemmer goes to some trouble to try to remove or re-add -e but was still getting this case wrong.

@ojwb ojwb added this to the 3.1.0 milestone Apr 29, 2025
@mitya57
Copy link
Contributor Author

mitya57 commented Apr 29, 2025

Thank you for the detailed response!
I also had little time in the last few weeks, but I hope I will find some time in May to make a new version of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants