Skip to content

Refactor Optimizer for correctness #2052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

stevenyoungs
Copy link
Contributor

@stevenyoungs stevenyoungs commented Apr 25, 2025

When the filter optimizer combines the result of two or more sub-filters, it is currently using the logical_op of the right hand sub-filter. Fix this so that the optimizer uses the parent filter logical_op

Fixes #13799 https://gramps-project.org/bugs/view.php?id=13799
Related forum discussion: https://gramps.discourse.group/t/strange-filter-behavior-in-6-0-1/7505/16

To reproduce using the example Gramps tree

  1. create a new filter called "Ancestors of" with a single rule, "Ancestors of "
    set ID to I0044, Inclusive ticked
    leave the Options set to "all rules must apply"
  2. create a new filter called "Siblings of Ancestors" with a single rule, "Siblings of match"
    set the filter to "Ancestors of"
    leave the Options set to "all rules must apply"
  3. Create a new filter called "The Family and their Spouces" with two rules
    a. rule 1: People matching the and set the filter to "Ancestors of"
    b. rule 2: People matching the and set the filter to "Siblings of Ancestors"
    set the Options to "At least one rule must apply"

Filter the people view using the filter "Siblings of Ancestors". 25 people are expected. 0 are shown

The filter "Ancestors of" returns 7 people
The filter "Siblings of Ancestors" returns 18 people
So the union of these two results must, at a minimum, return 18 people. Therefore a result of 0 people cannot be correct.

@stevenyoungs
Copy link
Contributor Author

@dsblank I think this is the correct fix but would appreciate your review

@dsblank
Copy link
Member

dsblank commented Apr 25, 2025

Fantastic! I'll take a look.

@Nick-Hall Nick-Hall added the bug label Apr 25, 2025
@dsblank
Copy link
Member

dsblank commented Apr 25, 2025

Here are the filters in XML:

<?xml version="1.0" encoding="utf-8"?>
<filters>
  <object type="Person">
    <filter name="Ancestors of" function="and">
      <rule class="IsAncestorOf" use_regex="False" use_case="False">
        <arg value="I0044"/>
        <arg value="1"/>
      </rule>
    </filter>
    <filter name="Family and their Spouses" function="or">
      <rule class="MatchesFilter" use_regex="False" use_case="False">
        <arg value="Ancestors of"/>
      </rule>
      <rule class="MatchesFilter" use_regex="False" use_case="False">
        <arg value="Siblings of Ancestors"/>
      </rule>
    </filter>
    <filter name="Siblings of Ancestors" function="and">
      <rule class="IsSiblingOfFilterMatch" use_regex="False" use_case="False">
        <arg value="Ancestors of"/>
      </rule>
    </filter>
  </object>
</filters>

Planning on making this a test in gramps.

@Nick-Hall
Copy link
Member

We should include a unit test with a fix like this.

@dsblank
Copy link
Member

dsblank commented Apr 25, 2025

@stevenyoungs I couldn't figure out how to make a PR on your PR. So here is a diff for adding a unittest for the above. It passes after your fix!

diff --git a/gramps/gen/filters/test/__init__.py b/gramps/gen/filters/test/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/gramps/gen/filters/test/filter_optimizer_test.py b/gramps/gen/filters/test/filter_optimizer_test.py
new file mode 100644
index 000000000..3fc0df3ca
--- /dev/null
+++ b/gramps/gen/filters/test/filter_optimizer_test.py
@@ -0,0 +1,76 @@
+from gramps.gen.filters import FilterList
+import tempfile
+import os
+import unittest
+
+from ...const import DATA_DIR
+from ...db.utils import import_as_dict
+from ...user import User
+from ...filters import reload_custom_filters
+
+reload_custom_filters()
+
+from ...filters import CustomFilters
+
+
+custom_filters_xml = """<?xml version="1.0" encoding="utf-8"?>
+<filters>
+  <object type="Person">
+    <filter name="Ancestors of" function="and">
+      <rule class="IsAncestorOf" use_regex="False" use_case="False">
+        <arg value="I0044"/>
+        <arg value="1"/>
+      </rule>
+    </filter>
+    <filter name="Family and their Spouses" function="or">
+      <rule class="MatchesFilter" use_regex="False" use_case="False">
+        <arg value="Ancestors of"/>
+      </rule>
+      <rule class="MatchesFilter" use_regex="False" use_case="False">
+        <arg value="Siblings of Ancestors"/>
+      </rule>
+    </filter>
+    <filter name="Siblings of Ancestors" function="and">
+      <rule class="IsSiblingOfFilterMatch" use_regex="False" use_case="False">
+        <arg value="Ancestors of"/>
+      </rule>
+    </filter>
+  </object>
+</filters>
+"""
+TEST_DIR = os.path.abspath(os.path.join(DATA_DIR, "tests"))
+EXAMPLE = os.path.join(TEST_DIR, "example.gramps")
+
+class OptimizerTest(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        """
+        Import example database.
+        """
+        cls.db = import_as_dict(EXAMPLE, User())
+
+        with tempfile.NamedTemporaryFile(mode='w+', encoding='utf8') as tmp_file:
+            tmp_file.write(custom_filters_xml)
+            tmp_file.seek(0)
+            fl = FilterList(tmp_file.name)
+            fl.load()
+
+        the_custom_filters = CustomFilters.get_filters_dict("Person")
+        cls.filters = fl.get_filters_dict("Person")
+        for filter_name in cls.filters:
+            the_custom_filters[filter_name] = cls.filters[filter_name]
+
+    def test_ancestors_of(self):
+        filter = self.filters["Ancestors of"]
+        results = filter.apply(self.db)
+        self.assertEqual(len(results), 7)
+
+    def test_siblings_of_ancestors(self):
+        filter = self.filters["Siblings of Ancestors"]
+        results = filter.apply(self.db)
+        self.assertEqual(len(results), 18)
+
+    def test_family_and_their_spouses(self):
+        filter = self.filters["Family and their Spouses"]
+        results = filter.apply(self.db)
+        self.assertEqual(len(results), 25)

Let me know if there is way to do this appropriately.

Copy link
Member

@dsblank dsblank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice; thanks!

@stevenyoungs stevenyoungs force-pushed the filter-optimizer-logical_op branch from f85513f to 5d7537f Compare April 26, 2025 13:12
@stevenyoungs
Copy link
Contributor Author

hhmmmm I'm not sure why other tests are suddenly failing. I'm glad I made you co-author now @dsblank 🤣

@dsblank
Copy link
Member

dsblank commented Apr 26, 2025

I'll take a look :)

@dsblank
Copy link
Member

dsblank commented Apr 26, 2025

It is probably because I altered the global CustomFilters. Let me see if I can cleanup after the new tests.

@dsblank
Copy link
Member

dsblank commented Apr 26, 2025

Definitely caused by my test. Attempting to fix...

@dsblank
Copy link
Member

dsblank commented Apr 26, 2025

Okay, I think I have made the pytest and Gramps gods happy.

@Nick-Hall
Copy link
Member

Please add entries in the po/POTFILES.skip file for the new files.

@stevenyoungs
Copy link
Contributor Author

Please add entries in the po/POTFILES.skip file for the new files.

Done.

@kkujansuu
Copy link
Contributor

Sorry to intrude but as far as I can see filtering still does not work in Gramps 6 even with this optimizer fix. See for example the filters below.

The filter F1 obviously should yield all individuals in the database (since it is an "or" filter with the "Everyone" rule).
But it only yields the same results as filter F3:

  <object type="Person">
    <filter name="F1" function="or">
      <rule class="Everyone" use_regex="False" use_case="False">
      </rule>
      <rule class="MatchesFilter" use_regex="False" use_case="False">
        <arg value="F2"/>
      </rule>
    </filter>
    <filter name="F2" function="and">
      <rule class="MatchesFilter" use_regex="False" use_case="False">
        <arg value="F3"/>
      </rule>
    </filter>
    <filter name="F3" function="and">
      <rule class="IsAncestorOf" use_regex="False" use_case="False">
        <arg value="I0041"/>
        <arg value="0"/>
      </rule>
    </filter>
  </object>

test

Or have I misunderstood something?

@dsblank
Copy link
Member

dsblank commented Apr 28, 2025

This looks like another test to add to the test suite. Thanks!

@dsblank
Copy link
Member

dsblank commented Apr 28, 2025

Added @kkujansuu 's additional issue as another set of unittests. Now we have one failure to fix in the optimizer.

@stevenyoungs
Copy link
Contributor Author

Here's a tabular version of the nice diagram @kkujansuu created. It includes additional information on the rules which have a selected_handles attribute.

name rule logical_op optimizer
F3 Ancestors of "I0041" and selected_handles
F2 People matching the "F3" and
F1 Everyone
People matching the "F2"
or

The filter optimizer walks the tree of filters, looking for rules where

  1. there is a selected_handles attribute
  2. the logical_op is "and"

Where both conditions are true, the selected_handles are used to reduce the search space for any other, non-optimized rules. It reduces the search space by limiting handles which are checked in subsequent rules to those in selected_handles (handles_in) or for inverted rules, testing all handles in the db, excluding those handles in selected_handles (handles_out).

Earlier in this PR, I modified test 2 (logical_op == 'and') to check the logical_op of the parent filter. The parent filter is F2 in this scenario. As can be seen, this also has a logical_op of 'and'.

From the table above, we can see that the rule in filter F3 meets both conditions (selected_handles and logical_op=='and' [or parent_logical_op=='and']). The optimizer therefore only considers the 16 handles resulting from filter F2 when evaluating the "Everyone" rule. Whilst this is quicker, it is also, unfortunately, wrong.

Alas, I'm still not sure what the correct fix is....

@dsblank
Copy link
Member

dsblank commented Apr 29, 2025

I think the issue is actually at a higher level: at the first level there is an "OR" (everything or the other branch). When there is such an "OR" we should not use the optimizer's selected_handles at this level or any where below.

@dsblank
Copy link
Member

dsblank commented Apr 29, 2025

@stevenyoungs, I made a change cc8bc77 that I believe fixes the issues, but not 100% sure yet. Need further testing. I think we can make up a tree of AND's and OR's that still fails. More investigation needed...

@dsblank
Copy link
Member

dsblank commented Apr 29, 2025

I don't think the failing tests have anything to do with these changes.

@stevenyoungs
Copy link
Contributor Author

@stevenyoungs, I made a change cc8bc77 that I believe fixes the issues, but not 100% sure yet.

Like you, I'm not entirely confident, but it feels wrong. parent_logical_op is now always going to be the logical_op of the root (top-most) filter. i.e. constant for a given execution of Optimizer. So every entry appended to result will have the same logical_op.
get_handles, which tests the value of logical_op for each value in result (now called all_selected_handles) will consider everything as "and" or as "or" (because parent_logical_op is always the same value). so if one of the lower filters has a logical_op of "or", I believe it would get treated by get_handles as "and".

My assumption is that this works for @kkujansuu's example because the root filter is "or". If there was an F0 filter added on top, which used "and", my expectation is that the wrong result will be generated.

I think we can make up a tree of AND's and OR's that still fails.

I believe so too. I think we need a multi-level filter, where the filter on each level has multiple rules. We then alternate these filters to use "and" or "or". Then repeat with each filter using "or" and repeat again with each filter using "and". Then for completeness, repeat with negation. @kkujansuu's example leads us in this direction.

@kkujansuu
Copy link
Contributor

How about rewriting the optimizer completely and replacing it with these two functions that call each other recursively:

   compute_potential_handles_for_filter(filt) -> set[str]
   compute_potential_handles_for_rule(rule) -> set[str]

High level pseudocode (that does not handle "invert") could be something like:

    def compute_potential_handles_for_filter(filt):
        if len(filt.flist) == 0: return ALL_HANDLES
        handlesets = [compute_potential_handles_for_rule(rule) for rule in filt.flist]
        if filt.logical_op == "and":
            handles = intersection(handlesets)
        if filt.logical_op in ("or", "one"):
            handles = union(handlesets)
        return handles

    def compute_potential_handles_for_rule(rule):
        if hasattr(rule, "selected_handles"):
            return set(rule.selected_handles)
        if hasattr(rule, "find_filter"):
            rule_filter = rule.find_filter()
            if rule_filter is not None:
                return compute_potential_handles_for_filter(rule_filter)
        return ALL_HANDLES

GenericFilter would call compute_potential_handles_for_filter.

ALL_HANDLES would be the set of all handles in the current namespace.

Still todo: add "invert" handling and replace ALL_HANDLES with something more efficient.

Could something like this work?

@dsblank
Copy link
Member

dsblank commented Apr 30, 2025

Could something like this work?

Yes, I think so. The thing I missed in my first version was the ability to short-circuit any further processing when encountering an "or" at any level.

@stevenyoungs stevenyoungs force-pushed the filter-optimizer-logical_op branch from cc8bc77 to 177ddb5 Compare May 1, 2025 07:23
@stevenyoungs
Copy link
Contributor Author

I've taken @kkujansuu's idea and implemented. Initial results look promising.
His example filters from #2052 (comment) now produce the correct results.

It does not yet handle inverted filters, therefore tests are expected to fail.

@stevenyoungs stevenyoungs force-pushed the filter-optimizer-logical_op branch 2 times, most recently from e5da87e to 3ac68e7 Compare May 1, 2025 07:39
@stevenyoungs
Copy link
Contributor Author

Implemented support for inverted filters.

I'm not sure that "one" is handled correctly. I think it should use symmetric_difference() rather than union. union will give us results where more than one rule match

@stevenyoungs
Copy link
Contributor Author

selected_handles are exactly the handles that match the rule (given a specific context)

Good, I think that makes more sense.
The comment at the top of Optimizer.get_handles (on the maintenance\gramps60 branch) talks about handles_in being "a superset of the items that will match"
Given this change, I then agree with:

Further rules (in a specific context) can only make that set smaller.
so long as logical_op is "and" and not inverted.

I'm not sure I follow about why you think there could be an issue with the current version. Can you come up with a filter that shows what you mean?

My previous comment overlapped your commit. I'll take a look at your most recent commits now.

@DaveSch-gramps
Copy link
Contributor

Ran a complex filter and the new changes worked.

I ran the same filter again with the change from intersection to union and the exact same results were returned.

@dsblank
Copy link
Member

dsblank commented May 27, 2025

Thanks @DaveSch-gramps! With the last fix, and more tests, I think we finally have all cases covered.

@stevenyoungs
Copy link
Contributor Author

Thanks @DaveSch-gramps! With the last fix, and more tests, I think we finally have all cases covered.

Brave words given the history of this bug and PR - but I'm also feeling optimistic now.

@dsblank
Copy link
Member

dsblank commented May 29, 2025

I do feel much more brave about this implementation :) The code is now easier to understand (and maintain) and we have many more tests.

@stevenyoungs
Copy link
Contributor Author

test_not_i0001_and_i0002 uses the filter "not (I0001 and I0002)" which in turn requires two sub-filters "I0001" and "I0002"
These sub-filters did not exist, yet the unit tests were still passing. Messages were generated in the build output

Can't find filter I0001 in the defined custom filters
Can't find filter I0002 in the defined custom filters

I've fixed the immediate problem by defining the filters "I0001" and "I0002". In my view the unit test should also have failed.

@dsblank
Copy link
Member

dsblank commented Jun 1, 2025

I've fixed the immediate problem by defining the filters "I0001" and "I0002". In my view the unit test should also have failed.

Good catch. Looking into how the tests could have passed...

@dsblank
Copy link
Member

dsblank commented Jun 1, 2025

Looking into how the tests could have passed...

I see... it returns all handles, which is correct. I'll add more tests where "all" doesn't necessarily give a good answer.

@dsblank
Copy link
Member

dsblank commented Jun 1, 2025

@stevenyoungs if you test with either #2071 or #2062 you'll see that 12 tests fail. I like the idea, but we should probably wait for 6.1 to add features to this PR.

@stevenyoungs stevenyoungs force-pushed the filter-optimizer-logical_op branch 2 times, most recently from 018ffcb to 26ad7a7 Compare June 1, 2025 21:03
@stevenyoungs
Copy link
Contributor Author

stevenyoungs commented Jun 1, 2025

@stevenyoungs if you test with either #2071 or #2062 you'll see that 12 tests fail. I like the idea, but we should probably wait for 6.1 to add features to this PR.

Good find. I've added a test that catches this scenario. The build has just failed, due to this test, so it is working.
Now I'll reorder the commits so we have the test first.

Edit to add, I used "parents of" as it was the simplest optimized rule I could find that is already in the gramps60 branch.
I noticed that IsParentOfFilterMatch.prepare can be further optimised if the sub-filter itself is optimised. If so, it can use the selected_handles that the sub-filter has already computed rather than iterate over all people in the db.
A second optimisation: a rules prepare method typically works over all handles in the db. This could be wasteful if GenericFilter.apply_logical_op_to_all is given a subset of handles to filter (via the possible_handles argument). Only these handles need to be prepared by the rule.
Both optimisations are for another day and another PR.

@stevenyoungs stevenyoungs force-pushed the filter-optimizer-logical_op branch from 26ad7a7 to 4325eaa Compare June 1, 2025 21:14
@Nick-Hall
Copy link
Member

I like the idea, but we should probably wait for 6.1 to add features to this PR.

Yes. Please don't include new features into fixes in maintenance branches.

@stevenyoungs
Copy link
Contributor Author

stevenyoungs commented Jun 1, 2025

@stevenyoungs if you test with either #2071 or #2062 you'll see that 12 tests fail. I like the idea, but we should probably wait for 6.1 to add features to this PR.

I backed out the optimization of "or" and "one" in such a way that it is simple to enable in master, when the time is right.
Alternatively, @Nick-Hall, just drop the last two commits from this branch: 4325eaa and fb926b1 and I can put them on a new PR against main

@dsblank dsblank added this to the v6.0 milestone Jun 7, 2025
@dsblank dsblank requested a review from Copilot June 9, 2025 18:19
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the filter optimizer to ensure that when combining the results of sub‑filters the logical operator of the parent filter is used instead of that of the right-hand sub‑filter. Key changes include the addition of new test files, modifications to filter rule tests (including tree support), and comprehensive updates to the optimizer and generic filter implementations to correct the operator usage.

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
po/POTFILES.skip Added entries to skip new test directories for generated tests.
gramps/plugins/test/reports_test.py Updated assertions to include error outputs for better debugging.
gramps/gen/filters/rules/test/place_rules_test.py Added tree parameter support to the filter test method.
gramps/gen/filters/rules/person/_isdescendantfamilyof.py Added conditional check prior to adding spouse_handle to results.
gramps/gen/filters/optimizer.py Refactored filter optimizer logic and operator handling.
gramps/gen/filters/_genericfilter.py Updated filter application logic and docstrings to reflect changes.
gramps/gen/filters/_filterlist.py Added loadString method for loading custom filters from strings.
gramps/gen/filters/init.py Added helper to set the global custom filters.
Comments suppressed due to low confidence (1)

gramps/gen/filters/_genericfilter.py:175

  • Since the Optimizer is now instantiated without a filter parameter and expects the filter to be passed explicitly in compute_potential_handles_for_filter, please update the documentation to clarify this change.
optimizer = Optimizer()

Refactored using simplified logic.
Added unit tests.

Fixes #13799.

Co-authored-by: Doug Blank <[email protected]>
@Nick-Hall Nick-Hall force-pushed the filter-optimizer-logical_op branch from 2911f8d to be30d5e Compare June 9, 2025 21:27
@Nick-Hall Nick-Hall dismissed their stale review June 9, 2025 21:36

This looks good now.

@Nick-Hall Nick-Hall merged commit be30d5e into gramps-project:maintenance/gramps60 Jun 9, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants