Introduce masking class and incorporate in TokenizerMasking #383

shmh40 · 2025-06-24T11:30:32Z

Description

WIP: create and instantiate a Masker class which implements masking strategies and is called in TokenizerMasking to produce masked source and target tokens. This class should allow the implementation of different masking strategies (e.g. random, per healpix cell, inpainting etc.) and should ensure the masking of source and target are properly aligned.

Current questions:

The Masker returns masked source_data, and the mask perm_sel. This perm_sel is then used in tokenizer_masking, outside of the Masker. Perhaps both the source data and the target data should be produced using this Masker class? We could include, in Masker.mask, if target: use ~perm_sel to produce masked target data.
I haven't quite worked out how this should link with different streams, if we want to mask a whole stream and predict from another stream.
The second masking strategy, "block" is just a placeholder. Ideally this would be implemented with healpix cells. I need to work out/implement how to use the healpix cells in this Masker class too. Just pass them (produced in TokenizerMasking?) and use them?

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Issue Number

Resolves #380
Resolves #408

Code Compatibility

I have performed a self-review of my code

Code Performance and Testing

I ran the uv run train and (if necessary) uv run evaluate on a least one GPU node and it works
If the new feature introduces modifications at the config level, I have made sure to have notified the other software developers through Mattermost and updated the paths in the $WEATHER_GENERATOR_PRIVATE directory

Dependencies

I have ensured that the code is still pip-installable after the changes and runs
I have tested that new dependencies themselves are pip-installable.
I have not introduced new dependencies in the inference portion of the pipeline

Documentation

My code follows the style guidelines of this project
I have updated the documentation and docstrings to reflect the changes
I have added comments to my code, particularly in hard-to-understand areas

Additional Notes

tjhunter

a few style comments

src/weathergen/datasets/masking.py

…sking to use these, then style improvements

clessig

Thanks for the draft. Let's try to keep it as much as possible as a refactor and introduce new features later; same for some fixes that are not directly related.

The Masker should encapsulate the masking as much as possible.

clessig · 2025-06-24T16:52:37Z

src/weathergen/datasets/masking.py

+    def mask_source(
+        self,
+        tokenized_data: list[torch.Tensor],
+        rng: np.random.Generator,


This should be part of the state, potentially passed in the constructor

Agreed thank you - done with new commit

clessig · 2025-06-24T16:53:38Z

src/weathergen/datasets/masking.py

+        self,
+        tokenized_data: list[torch.Tensor],
+        rng: np.random.Generator,
+        masking_rate: float,


It's passed in the constructor. Why is it passed here again? any sampling should also happen in this class

clessig · 2025-06-24T16:54:17Z

src/weathergen/datasets/masking.py

+
+
+class Masker:
+    """Class to generate boolean masks for token sequences and apply them.


I think the class can also be used for BERT-type masking + noising.

Updated the docstring, remove boolean

clessig · 2025-06-24T16:54:57Z

src/weathergen/datasets/masking.py

+        if num_tokens == 0:
+            return tokenized_data, []
+
+        # Determine the masking rate to use for this call


Remove, see above

clessig · 2025-06-24T16:55:32Z

src/weathergen/datasets/masking.py

+
+        if self.masking_rate_sampling:
+            rate = np.clip(
+                np.abs(rng.normal(loc=rate, scale=1.0 / (2.5 * np.pi))),


We should parametrize this. But better to do it in a separate PR

clessig · 2025-06-24T17:00:51Z

src/weathergen/datasets/multi_stream_data_sampler.py

@@ -336,6 +339,8 @@ def __iter__(self):

                            (ss_cells, ss_lens, ss_centroids) = self.tokenizer.batchify_source(
                                stream_info,
+                                # NOTE: two unused arguments in TokenizerMasking,
+                                # still used in TokenizerForecast?
                                self.masking_rate,


No, they should be removed

clessig · 2025-06-24T17:01:47Z

src/weathergen/datasets/tokenizer_utils.py

@@ -178,6 +178,7 @@ def tokenize_window_space(
    if len(source) < 2:
        return

+    # idx_ord_lens is length...


idx_ord_lens is length is number of tokens per healpix cell

clessig · 2025-06-24T17:01:59Z

src/weathergen/datasets/tokenizer_masking.py

-            )
-            for cc, pp in zip(target_tokens_cells, self.perm_sel, strict=True)
-        ]
+        ######################


Remove before we merge

clessig · 2025-06-24T17:02:29Z

src/weathergen/datasets/tokenizer_masking.py

@@ -280,7 +278,6 @@ def id(arg):
        )

        # tokenize
-        # TODO: properly set stream_id; don't forget to normalize


This still needs to be done

clessig · 2025-06-24T17:03:25Z

src/weathergen/datasets/tokenizer_masking.py

-            mask[self.rng.integers(low=0, high=len(mask))] = False
+        # if masking rate is 1.0, all tokens are masked, so the source is empty
+        # but we must compute perm_sel for the target function
+        if masking_rate == 1.0:


This case should be handled in the Masker and not here

…_class

…ng_rate, update comments, remove archived class

…rom batchify_source

…Masker class, remove handling special cases of masking (all masked)

clessig

Just some minor comments. If it has been tested, then it's good to be merged.

clessig · 2025-06-26T18:44:21Z

src/weathergen/datasets/masking.py

+    ):
+        self.masking_rate = masking_rate
+        self.masking_strategy = masking_strategy
+        # self.masking_combination = masking_combination


Can we remove this line before we merge.

clessig · 2025-06-26T18:46:34Z

src/weathergen/datasets/masking.py

+        # Initialize the random number generator.
+        worker_info = torch.utils.data.get_worker_info()
+        div_factor = (worker_info.id + 1) if worker_info is not None else 1        
+        self.rng = np.random.default_rng(int(time.time() / div_factor))


The rng seed should be passed from the cf.seed, to ensure we can reproduce maskings if we want, while ensuring it is different for the different parallel workers.

I would suggest to keep it as is to not overload the PR but open once merged, open an issue to address the problem.

clessig · 2025-06-26T18:47:55Z

src/weathergen/datasets/masking.py

+            )
+
+        # Handle the special case where all tokens are masked
+        # NOTE: not going to handle different streams correctly.


I don't fully understand the comment. Can you please elaborate.

Removed, my misunderstanding

clessig · 2025-06-26T18:49:11Z

src/weathergen/datasets/masking.py

+        if self.masking_strategy == "random":
+            flat_mask = self.rng.uniform(0, 1, num_tokens) < rate
+
+        elif self.masking_strategy == "block":


Have you visualized this and checked correctness?

Yes, visualised and looks as expected. Just a placeholder for now. Healpix in next PR.

clessig · 2025-06-26T18:50:23Z

src/weathergen/datasets/masking.py

+        """
+
+        # check that self.perm_sel is set with an assert statement
+        assert hasattr(self, 'perm_sel'), "Masker.perm_sel must be set (in mask_source) before calling mask_target."


You set it in the constructor (to None) so the hasattr will alway be true. You need to test for not-None

…_class

shmh40 · 2025-06-27T15:27:25Z

Note, this PR does introduce a new config parameter, and hence other developers will be notified.

include a masking strategy here, currently only supporting "random" and "block"
masking_strategy: "random"

…ttings.

shmh40 added 5 commits June 24, 2025 08:14

creating masking class and adapting tokenizer_masking to use this class

72fb7de

minor changes to masking.py and tokenizer_masking

570056d

removed old tokenizer_masking

7a383b1

include masking_strategy in default_config

c0b726c

change ValueError to assert

370d5e1

github-project-automation bot added this to WeatherGen-dev Jun 24, 2025

shmh40 added 2 commits June 24, 2025 11:34

linting formatting changes files

08d676f

further linting of docstrings

5fbc9b1

shmh40 requested a review from clessig June 24, 2025 13:52

shmh40 self-assigned this Jun 24, 2025

shmh40 added the enhancement New feature or request label Jun 24, 2025

tjhunter reviewed Jun 24, 2025

View reviewed changes

src/weathergen/datasets/masking.py Outdated Show resolved Hide resolved

src/weathergen/datasets/masking.py Outdated Show resolved Hide resolved

src/weathergen/datasets/masking.py Outdated Show resolved Hide resolved

shmh40 marked this pull request as draft June 24, 2025 14:54

shmh40 and others added 3 commits June 24, 2025 16:33

create mask_source and mask_target in Masker, and update tokenizer_ma…

cbe0a09

…sking to use these, then style improvements

linted masking, tokenizer_masking

0388176

Merge branch 'develop' into shmh40/dev/masking_class

7a75a36

clessig reviewed Jun 24, 2025

View reviewed changes

clessig moved this to In Progress in WeatherGen-dev Jun 24, 2025

shmh40 added 6 commits June 25, 2025 09:13

Merge remote-tracking branch 'origin/develop' into shmh40/dev/masking…

e872ba3

…_class

modify masker, rng and perm_sel now part of class, remove extra maski…

15504df

…ng_rate, update comments, remove archived class

remove check if all masked, not masked

4a7a43d

remove self.masking_rate from MultiStreamDS class, and masking args f…

fe4224b

…rom batchify_source

update tokenizer utils with description of idx_ord_lens in comment

6170a69

remove masking args from batchify_, perm_sel removed now internal to …

4c80c13

…Masker class, remove handling special cases of masking (all masked)

clessig approved these changes Jun 26, 2025

View reviewed changes

shmh40 added 5 commits June 27, 2025 15:04

adding masking_strategy: to config

4d5e947

Merge remote-tracking branch 'origin/develop' into shmh40/dev/masking…

d8038d5

…_class

remove unused mentions of masking_combination

0126820

removed comment about streams

ea726ca

changed assert to check self perm_sel is not None

20df58b

ruff masking, tokenizer_masking

099fd39

shmh40 marked this pull request as ready for review June 27, 2025 15:25

shmh40 mentioned this pull request Jun 27, 2025

Implementation of healpix cell masking #407

Draft

13 tasks

clessig added 7 commits June 28, 2025 11:08

Ruffed

cd08bdc

Added warning to capture corner case, likely due to incorrect user se…

2889962

…ttings.

Fixed incorrect call twice

9a40ac4

Fixed missing conditional for logger statement

59bdd6f

Required changes for better handling of rngs

72ed775

Improved handling of rngs

1663c27

Improved handling of rng

64df768

clessig merged commit 6f831c3 into develop Jun 28, 2025
3 checks passed

github-project-automation bot moved this from In Progress to Done in WeatherGen-dev Jun 28, 2025

clessig deleted the shmh40/dev/masking_class branch June 28, 2025 12:31



		class Masker:
		"""Class to generate boolean masks for token sequences and apply them.

Introduce masking class and incorporate in TokenizerMasking #383

Introduce masking class and incorporate in TokenizerMasking #383

Uh oh!

Conversation

shmh40 commented Jun 24, 2025 • edited by clessig Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Issue Number

Code Compatibility

Code Performance and Testing

Dependencies

Documentation

Additional Notes

Uh oh!

tjhunter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shmh40 Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

idx_ord_lens is length is number of tokens per healpix cell

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clessig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shmh40 commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shmh40 commented Jun 24, 2025 •

edited by clessig

Loading

shmh40 Jun 25, 2025 •

edited

Loading

shmh40 commented Jun 27, 2025 •

edited

Loading