-
Notifications
You must be signed in to change notification settings - Fork 181
Improved Polish stemmer from #159 #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks, this is definitely an improvement. Especially that it's now mostly working
That would be very useful. We really need some documentation of the algorithm or else it's much harder to maintain going forwards because we have to guess at the intent; having meanings and examples for each ending would cover a significant part of that. |
4f19ad7
to
6ef006b
Compare
Linguistically |
6ef006b
to
9e37dcb
Compare
Updated to add comments and examples (I hope I can use unicode in comments? The Lithuanian stemmer also has them). And addressed all your other comments. Unfortunately the Ada tests fail. I have no experience with that language, but I can take a look tomorrow. Also should I replace tabs with spaces in indentation? The original file used tabs, but I see that most existing stemmers (all except Nepali) use spaces. |
9e37dcb
to
42626da
Compare
Regarding Ada tests: it fails on This patch makes the tests pass, but I'm not sure if it's the proper fix: --- a/ada/src/stemmer.ads
+++ b/ada/src/stemmer.ads
@@ -181,11 +181,11 @@ private
S : in String;
Adjustment : out Integer) with
Global => null,
- Pre => C_Bra >= Context.Lb and C_Ket >= C_Bra and C_Ket <= Context.L;
+ Pre => C_Ket >= C_Bra and C_Ket <= Context.L;
procedure Slice_Del (Context : in out Context_Type'Class) with
Global => null,
- Pre => Context.Bra >= Context.Lb and Context.Ket >= Context.Bra
+ Pre => Context.Ket >= Context.Bra
and Context.Ket <= Context.L;
procedure Slice_From (Context : in out Context_Type'Class; |
Those Ada preconditions should not include anything to do with Dropping the invalid preconditions is easy, but the Ada code also seems to use Unicode is fine in comments (sources should be UTF-8 with Unix line endings). The intention is to only use spaces for indentation except where there's a syntax reason (e.g. |
42626da
to
ac4acc1
Compare
Dear me, quite a lot of wrong stuff in Ada runtime support (and the generator too). I've pushed fixes for the stuff that's relevant here so hopefully the next build will be all green. |
ac4acc1
to
8056aa8
Compare
8056aa8
to
315e7c4
Compare
Snowball is currently all licensed as 3-clause BSD, and I really don't want to start to allow a proliferation of licences. MIT/X is a broadly equivalent licence, but it still makes it harder for users to understand the licensing situation. The original PR author says in snowballstem/snowball-website#29 (comment) :
Copyright applies to the implementation of an algorithm, so a rewrite from scratch can use a different choice of licence. (We should of course acknowledge the origin of the algorithm, but we don't need to use the same licence to do so.) |
In case they were intended for -ę words like zwierzę, they had more false positives than true ones. And that declension is quite rare.
315e7c4
to
073537e
Compare
I pushed a new version that hopefully addresses all your comments and all my own TODO comments. I had a temptation to rewrite the stemmer to avoid repeating adjective-like endings in three places (for adjectives, superlatives and participles) and even started doing that in mitya57/snowball@0e54859, but those changes were too disruptive, so in the end I took a more conservative approach which was extending the current algorithm. |
And a question. Currently we have logic like:
But what this code really tries to check is that after removing the endings, the resulting stem is not too short (e.g. there are at least 2 characters left). What is the most elegant way to achieve this? Other Slavic stemmers try to operate in terms of syllables rather than letters, maybe we should try to do the same here? |
Most of the other stemmers set up regions in advance and then test those. Usually these are based on syllables, but a length based version would look like:
where
and (usually) declare a helper routine to test for the region:
Then there's a choice for how you apply it - either:
or:
The difference is that the first finds the longest matching suffix from those in the E.g. the English stemmer's
If that used
Due to the limit, the string the among actually tests against is I don't really have any useful insights as to which to use when - I just end up trying both for each situation and seeing which works better.
Most of the stemmers work in terms of syllables. It does seem a better approach, though the difference it makes will depends on the language - if most words alternate vowel and consonant then it'll not make as much difference. Most languages use the same R1 definition as One of the first two is probably a good starting point unless there's something that Polish has in common with one of the other languages that's relevant here. The I notice though that the difference between the length threshold and the suffix length is not the same for all suffixes - it looks like it can be 1, 2 or 3. That may essentially be a bug, or actually make not difference in practice. The 1 vs 3 could be equivalent to R1 vs R2 (R2 is another syllable after R1). (Incidentally, |
We were removing part of the root here. Also cases other than nominative were not handled.
Updated once again. This time I also tried to solve problem that the original stemmer removed some d-suffix but did not support all possible i-suffices for it. In particular, I added possible declension forms for diminutives and for -owy adjectives. Also, I added support for more verbal forms and made the list of verbal forms better structured (e.g. all past forms are grouped together). The corresponding data changes are in snowballstem/snowball-data#30, each one in a separate commit to make it easier to review. |
'ach' // plural locative | ||
'ami' // plural instrumental | ||
'nia' // genitive of words ending with -nie, e.g. czytanie - czytania | ||
'niu' // dative/locative of words ending with -nie, e.g. czytanie - czytaniu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
czytaniu would actually be handled by 'aniu'
on line 62.
'niu' // dative/locative of words ending with -nie, e.g. czytanie - czytaniu | ||
'cia' // genitive of words ending with -cie, e.g. ukrycie - ukrycia | ||
'ciu' // dative/locative of words ending with -cie, e.g. ukrycie - ukryciu | ||
'ce' // dative/locative of words ending with -ka, e.g. muzyka - muzyce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice muzyka stems to muzyk while muzyce stems to muz - not necessary a problem, but if it's unexpected or more widespread this might be worth looking at.
'{ak}ca' // nominative singular (feminine) | ||
'{ak}cej' // genitive/dative/locative singular (feminine) | ||
'{ak}c{ak}' // accusative singular (feminine) | ||
'{ak}ce' // nominative/accusative singular (neuter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
czytające will actually get -ce removed by remove_nouns
then -ą
removed by remove_general_ends
, rather than by this rule. So this may be redundant, but perhaps -ące can be exposed by removal of an ending in a previous step.
'{ak}c' // contemporary adverbial participle (transgressive), e.g. czytając | ||
// different forms of active adjectival participle, e.g. czytający (one who is reading) | ||
'{ak}cy' // nominative singular (masculine) | ||
'{ak}cego' // genitive singular (masculine/neuter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
czytającego will actually get -ego removed by remove_adjective_ends
then -ąc
removed by the rule in this among
(line 149).
// different forms of active adjectival participle, e.g. czytający (one who is reading) | ||
'{ak}cy' // nominative singular (masculine) | ||
'{ak}cego' // genitive singular (masculine/neuter) | ||
'{ak}cemu' // dative singular (masculine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very similar to the -ego case just above.
'{ak}cemu' // dative singular (masculine) | ||
'{ak}cym' // instrumental/locative singular (masculine) | ||
'{ak}ca' // nominative singular (feminine) | ||
'{ak}cej' // genitive/dative/locative singular (feminine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also similar to the -ego case.
'{ak}cej' // genitive/dative/locative singular (feminine) | ||
'{ak}c{ak}' // accusative singular (feminine) | ||
'{ak}ce' // nominative/accusative singular (neuter) | ||
'{ak}cych' // genitive plural |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the -ego case.
'{ak}cymi' // instrumental plural | ||
(delete) | ||
'aj' // imperative, e.g. czytaj | ||
(<- 'a') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
czytając (example given further above) stems to czytaj, but czytaj gets the -j removed here and then -a removed by remove_general_ends
to give stem czyt.
They both seem to be forms of czytać and that also stems to czyt which kind of suggests maybe we ought to also be removing the aj from czytając?
Maybe that causes other problems though.
setlimit tomark p1 for ([substring]) among ( | ||
'{o'}w' // genitive plural, e.g. włos - włosów | ||
'om' // dative plural, e.g. włos - włosom | ||
'ami' // instrumental plural, e.g. włos - włosami |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-ami actually gets removed from włosami by remove_nouns
(noted as "plural instrumental" there). Maybe this is needed if -ami can be exposed by removing another suffix but we also need to remove -ami earlier to expose suffixes, but if so that kind of suggests the order of rules might not be quite right.
Thank you for the comments! Please don't wait for me if you want to get a new release out, I will fix them eventually. As I will need to rework it again anyway, I want to ask some questions myself first:
|
Sorry, I got rather swallowed up trying to make progress towards a new release.
Thanks - I am currently leaning towards getting a new release out (it's very nearly 3.5 years since the last one which is too long and there's a huge pile of unreleased useful changes), and then dealing with the pending new algorithms and making another release with those. As well as this there's Czech (#151) mostly needing documentation writing, Persian (#194) waiting for review, and Ukrainian (#144) waiting for the submitter.
If there's a clean structure to the suffixes in the language then reflecting that in the code makes sense, and eliminating checks for suffixes that can never be present seems a good idea - more efficient and eliminates the risk of removing something that looks like a suffix when it isn't. However you may find there are cases the suffix is removed but it's just how a word ends, and removing other suffixes incorrectly actually helps to conflate forms of a word together, albeit on a stem which isn't linguistically correct (which is OK for our purposes provided it doesn't collide with stems of unrelated words). This is quite common in English (which is really a mess of a language with a lot of loan words and the like from many other languages), but I've seen it in several others too. I saw a good example of this just the other day in an old mailing list post, but I can't now find it or think of a good example myself.
The key goal is to improve retrieval (without harming precision too much), so inflectional suffixes are probably the most critical. Diminutive suffices can change meaning enough that conflating or not are both reasonable. The linguistic structure may steer one to a conclusion which isn't fully supported by the meanings of the words - E.g. in English, a young dog is a "puppy" and not conflating seems the obvious choice, but in German a young "Hund" is a "Hündchen" - the structure of the words perhaps suggests conflating. The meanings are probably similarly different in the two languages (I'm not fluent enough in German to say for sure) and the difference is really that in English you'd need a special case, while in German you could conceivably have a -chen rule that also handled "Kätzchen", etc. -acj- is like English "ation", right? That seems more useful than diminutives at least. Not sure what -an- and -ow- mean.
Returning czyt as the stem is fine for our purposes, provided it doesn't result in conflating unrelated words. Thinking about English words ending "e", e.g. "pasting` comes from "paste" which has an entirely different meaning to "past". The stem for all of these was "past", but I've added an exception for the next release to address this. The English stemmer goes to some trouble to try to remove or re-add -e but was still getting this case wrong. |
Thank you for the detailed response! |
I took the stemmer which was originally written by @tomek-ai in #159 and rewrote it using substring checks.
The original commit is preserved (but rebased), my modifications are in separate commits for review purposes (please feel free to squash when merging). My first commit does not cause behavior changes, the subsequent commits do.
There is some room for improvement, e.g. the stemmer does not recognize all possible superlative forms (
naj-
), but it handles simpler cases quite well.The corresponding snowball-data PR is snowballstem/snowball-data#30.