Skip to content

Optionally disable mapset locking and cleaning #602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

wenzeslaus
Copy link
Member

@wenzeslaus wenzeslaus commented May 8, 2020

This allows to skip locking of a mapset which in combination with no cleaning of a mapset allows for multiple jobs to be executed in one mapset using --exec. This should be the same as running multiple jobs together right in a single session (having the same limitations, but also not touching any general files).

To account for the use case when the mapset is not used for any interactive session, a separate flag for cleaning is provided for convenience (same as something like --exec g.region -p).

Similarly to --tmp-location and --tmp-mapset, usage of --no-lock and --no-clean without --exec is not disallowed, but it is not documented either before its usefulness is evaluated. One use case is starting an interactive session with --no-lock --no-clean to examine results of jobs running with these.

The combining the new flags with the old ones and between each other is not limited except for few cases which clearly exclude each other (--clean-only means only cleaning and no cleaning and cleaning only don't make sense either). Although --no-lock and --no-clean are meant to be used together, saying what is the correct or allowed use is hard because the purpose of these flags is to disable the standard session behavior.

Additional changes:

  • Using atexit to call mapset cleaning functions to make more organized and thus easier to disable
  • Rename and document related functions.
  • Extract gis env (rc) saving from the clean up code because it is not related (the old function was called clean_env, so it seemed related).
  • If --tmp-mapset is used with --gui or --text (allowed but undocumented combination), do not store last used gis env vars (gisrc). This was missing from 9da94b7 (init: Add --tmp-mapset option #313).

This is a much better version of (no closed) Trac 2685 which suggested flag to ignore the lock. This achieves the same more formally by not creating the unwanted lock in the first place which is a standard situation on Windows and when managing the session manually directly through environmental variables. Additionally, this provides an important addition to avoid removal temporary files of another session during a clean up and a convenient way to do that afterwards.

As for the importance of --no-clean, as far as I understand, clean_temp program responsible for removing the files looks at pid of a process which supposedly created the file based on the file name and if the process is still running, it does not remove the file. (For other files it looks at the date and it works within the machine's/node's subdirectory.) However, in my test I do get some problems (maps not created) and warnings from clean_temp: WARNING: Can't stat file ...: No such file or directory,skipping. Additionally current clean up at the end of the session is vacuuming SQLite, so multiple vacuums are requested. (Perhaps this is a moderate issue because parallel jobs running in one mapset can practically use a single SQLite database only for reading.) The test follows.

grass ~/grassdata/nc_spm/test1 -c -e
for i in `seq 1 1000`
do
    grass ~/grassdata/nc_spm/test1/ --no-lock --exec r.mapcalc "e$i = sin(rand(0, 100))" -s &
done
wait
grass ~/grassdata/nc_spm/test1/ --exec g.list type=raster mapset=. | wc

grass ~/grassdata/nc_spm/test2/ -c -e
for i in `seq 1 1000`
do
    grass ~/grassdata/nc_spm/test2/ --no-lock --no-clean --exec r.mapcalc "e$i = sin(rand(0, 100))" -s &
done
wait
grass ~/grassdata/nc_spm/test2/ --exec g.list type=raster mapset=. | wc

This should improve parallel workflows, e.g. with GNU parallel or on HPC. Expected usage:

# prepare data
grass ~/grassdata/nc_spm/landsat/ ...
# somehow get these running in parallel
grass ~/grassdata/nc_spm/landsat/ --no-lock --no-clean --exec g.long.computation ...
grass ~/grassdata/nc_spm/landsat/ --no-lock --no-clean --exec g.long.computation ...
grass ~/grassdata/nc_spm/landsat/ --clean-only
# evaluate the results
# ...

@neteler neteler changed the title optional clean and lock grass.py: optional clean and lock May 8, 2020
@neteler
Copy link
Member

neteler commented May 8, 2020

(I took liberty to make the PR title more understandable)

@wenzeslaus wenzeslaus changed the title grass.py: optional clean and lock Optionally disable mapset locking and cleaning May 8, 2020
@wenzeslaus wenzeslaus requested a review from neteler May 8, 2020 20:27
@wenzeslaus
Copy link
Member Author

(I took liberty to make the PR title more understandable)

Thanks. There was a reason why this was marked as draft. In fact I'll keep it that way. It needs at least documentation. I'm not sure if I can create test for this. However, you can go ahead and have a look.

@wenzeslaus wenzeslaus added the enhancement New feature or request label May 8, 2020
@neteler
Copy link
Member

neteler commented May 8, 2020

(I took liberty to make the PR title more understandable)

Thanks. There was a reason why this was marked as draft. In fact I'll keep it that way.

All good but as far as I see a later editing of the title doesn't make it into the changelog. Only writing the PR title right from the beginning will help (I edited anyway, for the PR list). So, please please use a reasonable title when opening the PR.

@wenzeslaus
Copy link
Member Author

All good but as far as I see a later editing of the title doesn't make it into the changelog.

Sorry, meant to include a smiley face to the message :-) but about the changelog: The PR title is in no way the final title/subject line for the commit. Commit message can and should be edited when doing Squash and merge which is another advantage of Squash and merge I think. Besides tailoring it towards what is actually being merged/committed at the end, you can spell check it in your browser, and clean up the commit messages from individual commits on the branch which are usually messy.

The problem is that GitHub suggestion for commit message title is based on the branch name or the only the commit if it is the only commit. It does not seem to me that a PR title plays any role in that. See, e.g., #544 which has no PR title edit, but the suggested commit message (if you click is Squash and merge) is a more raw version of the text coming from the commit. Hence, the only way how to get good commit titles/subject lines for changelog is editing it when doing Squash and merge unless we want to resort to suggesting just one commit per branch/PR and getting it right the first time on the command line like in the Subversion times.

@neteler
Copy link
Member

neteler commented May 8, 2020

Hence, the only way how to get good commit titles/subject lines for changelog is editing it when doing Squash and merge

... ok, then all pls do that. The creation of https://trac.osgeo.org/grass/wiki/Release/7.8.3-News took too much of my lifetime. But that's OT here :-)

For reviewing this PR, is it best to locally test it?

@wenzeslaus
Copy link
Member Author

If anybody can think about some cases (use cases or more importantly potential failures) which I did not consider in the description, that would great.

I did not figure out a useful way of testing this with limited resources for automated tests in CI. The test in the description will take a lot of memory, but I had to use a lot of processes (1000 rather than 100) to be sure to get the errors without --no-clean consistently.

@mlennert
Copy link
Contributor

Just out of curiosity: could you explain why this is needed ? mapset creation is so easy, and you can always access data from other mapsets, that I've always been able to avoid any locking issues by just working with temporary mapsets.

@wenzeslaus
Copy link
Member Author

@mlennert I have added documentation which may answer some of your questions. As for the comparison with the workflow involving many temporary mapsets:

This is similar to --tmp-mapset addition in the way that it is trying to make some use cases easier. The standard workflow would be (you could collapse some steps, this is kind of a worst case scenario):

# create mapsets
grass -c -e ~/grassdata/nc_spm/process1
grass -c -e ~/grassdata/nc_spm/process2
...
# give access to data in other mapsets when not using full path, set regions even if all the same
grass ~/grassdata/nc_spm/process1 --exec ...
grass ~/grassdata/nc_spm/process2 --exec ...
...
# somehow get these actual computations running in parallel
grass ~/grassdata/nc_spm/process1 --exec ...
grass ~/grassdata/nc_spm/process2 --exec ...
...
# create mapset for results, maybe get access to all the other mapsets
grass -c -e ~/grassdata/nc_spm/final_results
# evaluate the results
grass ~/grassdata/nc_spm/final_results --exec ...
# delete the mapsets if needed
rm -r ~/grassdata/nc_spm/process1
rm -r ~/grassdata/nc_spm/process2
...

Here the only part which is here multiple times is the one which needs to be multiple times:

# create mapset
grass -c -e ~/grassdata/nc_spm/computation
# set whatever is needed there
grass ~/grassdata/nc_spm/computation --exec ...
# somehow get these actual computations running in parallel
grass ~/grassdata/nc_spm/computation --no-lock --no-clean --exec ...
grass ~/grassdata/nc_spm/computation --no-lock --no-clean --exec ...
...
# evaluate the results
grass ~/grassdata/nc_spm/computation --exec ...

Isn't this more dangerous, less robust, and won't work for some workflows anyway (e.g., writing to vector attributes to SQLite)? Yes, but even now people can do this 1) within one session or 2) when setting up the session manually.

  1. This PR just makes --exec the same as working within one session, so, e.g., your GNU parallel file is the same for within the session and outside of it with the only difference being the grass ... --exec part. No changes to your mapset handling just because you now need to run the processes from outside of the session (let's say interactive session on laptop versus HPC or cloud).

  2. Especially before --exec there was a lot of setting up of the session manually. This works in a lot of environments, but I think it is pain to set up besides being error prone. --exec has a lot of the same flexibility without doing any environment setup to make GRASS modules behave like normal commands, i.e., it makes GRASS behave kind of like programs with subcommands. However, some of the flexibility was not available namely running processes in parallel within the same session/mapset. This PR adds the option to behave like your manually set up session without locking and cleaning, but with a clear idea about what you are skipping and assurance that you are getting the setup right.

This partially relates to the splitting of a session into runtime needed to get programs running and connection to a mapset or more precisely this relates to separating the session concept into smaller pieces. grass ... --no-lock --no-clean --exec does all the runtime setup, but it does only a minimal mapset/database connection. Minimal in the sense that it is like if the (--exec'ed) module/script would be running in a mapset, but nothing more happens in or to the mapset, i.e. nothing besides what the module does.

@wenzeslaus wenzeslaus marked this pull request as ready for review May 12, 2020 04:10
@neteler neteler requested a review from metzm May 12, 2020 05:21
This is already used for system tmp dir and gislock.

Renames and better documents the mapset cleaning functions.
This allows to skip locking of a mapset which in combination with no cleaning
of a mapset allows for multiple jobs to be executed in one mapset using --exec.
This should be the same as running multiple jobs together right in a single session
(having the same limitations, but also not touching any general files).

To account for the use case when the mapset is not used for any interactive session,
a separate flag for cleaning is provided for convenience (same as something like
--exec g.region -p).

Similarly to --tmp-location and --tmp-mapset, usage of --no-lock and --no-clean without
--exec is not disallowed, but it is not documented either before its usefulness is
evaluated. One use case is starting an interactive session with --no-lock --no-clean
to examine results of jobs running with these.

The combining the new flags with the old ones and between each other is not limited
except for few cases which clearly exclude each other (--clean-only means only cleaning
and no cleaning and cleaning only don't make sense either). Although --no-lock and --no-clean
are meant to be used together, saying what is the correct or allowed use is hard because
the purpose of these flags is to disable the standard session behavior.
Using Bash for the parallel and lock use case. It is little simpler than using multiprocessing
and even with that in place, it would be a good alternative test in the future because
it is a likely use case. It covers well the no locking part.

The no cleaning part is covered by in the Python test using two different approximations
of a real case. The current cleaning functionality is tested as well, but not in a standalone
test case nor with --clean-only.

The most relevant command line combinations are tested in a separate test.
@wenzeslaus wenzeslaus force-pushed the optional-clean-and-lock branch from 142df27 to 3c3bf67 Compare May 12, 2020 18:10
@metzm
Copy link
Contributor

metzm commented May 13, 2020

If you allow skip locking of a mapset in order to run multiple jobs in the same mapset at the same time, it is entirely up to the user to ensure that the multiple jobs do not interfere with each other, e.g overwriting each other's results. That means it is up to the user to design multiple parallel jobs correctly.
This raises the question again if GRASS should be a fool-proof tool or if experts can easily use it at their own risk. Consequently, if anything goes wrong, the user is wrong, not the software. Does not sound good to me.
Therefore I would prefer to keep the current locking mechanism.
For many years I have been executing multiple jobs in parallel in the same mapset in a HPC environment and never encountered any problems. I have not tested multiple write attemps to a mapset's sqlite db though.

@wenzeslaus
Copy link
Member Author

If you allow skip locking of a mapset in order to run multiple jobs in the same mapset at the same time...

Anybody can run multiple jobs in one mapset even now. If you are on one machine and start one session, you can launch multiple jobs in it. The user even doesn't have to do any scripting. Run one from command line and another from GUI. Or multiple ones from GUI. There is nothing in GRASS GIS which is stopping user from doing (or at least trying) this. User can run parallel jobs with the --exec (old GRASS_BATCH_JOB) interface when you supply a parallel script. However, when the jobs are running through grass ... --exec, they cannot run in parallel in one mapset unlike the jobs in all the other cases.

What is this PR doing is just optionally lifting the parallel execution restrictions due to mapset locking from a particular use of the --exec interface.

...it is entirely up to the user to ensure that the multiple jobs do not interfere with each other, e.g overwriting each other's results. That means it is up to the user to design multiple parallel jobs correctly.

Designing parallel jobs is complex and user needs to have at least some understanding to run these successfully. So, although I would agree that it is not generally desirable to have more ways how users can shoot themselves in the foot, in this case, I think it is more desirable to remove obstacles for those who know what they are doing.

This raises the question again if GRASS should be a fool-proof tool or if experts can easily use it at their own risk. Consequently, if anything goes wrong, the user is wrong, not the software. Does not sound good to me.

For (too many) years a typical (the main?) way of running GRASS GIS in batch execution scenarios was setting up bunch of environmental variables. This is still possible and documented and locking and cleaning is entirely optional and totally up to the user. I think --exec is a great way for doing the same thing the right way, but I don't want the --exec impose restrictions which are not present when setting up a session manually, running an interactive one, or even running one --exec job with a parallel script.

@mlennert
Copy link
Contributor

mlennert commented May 16, 2020 via email

@wenzeslaus
Copy link
Member Author

@mlennert There is nothing which prevents users even now to run parallel jobs within one mapset as long as they get a session somehow, e.g., by starting standard GRASS GIS with command line or with --exec:

GRASS... > script_1.py map_1 &
GRASS... > script_1.py map_2 &
GRASS... > script_1.py map_3 &
grass ... --exec bash <<EOF
script_1.py map_1 &
script_1.py map_2 &
script_1.py map_3 &
EOF

You can do the same thing even with GUI.

The --no-lock --no-clean combo just allows you to do the same when you can't have only one grass process or you prefer the following (over the above --exec bash <<EOF trick):

grass ... --exec script_1.py map_1 &
grass ... --exec script_1.py map_2 &
grass ... --exec script_1.py map_3 &

@metzm
Copy link
Contributor

metzm commented May 18, 2020

I agree with @mlennert that we should not open additional doors for potential misuse. We do have modules that do controlled parallel processing. Users can use these modules as inspiration to create their own scripts.
Arbitrary processing of different commands in the same mapset can cause difficult to diagnose errors. I found out that the only fail-safe command to execute in near-parallel in the same mapset is g.copy raster=..., copying different raster maps from different mapsets to the same target mapset.

@HuidaeCho
Copy link
Member

HuidaeCho commented May 18, 2020

@wenzeslaus This is very interesting discussion. I wanted to add my 2 cents here. I understand that even now we can run parallel jobs from the GUI or command line that can cause issues, but that is not the nature or a designed feature of GRASS GIS. Rather, it's a feature of the GUI or UN*X shell and GRASS GIS just happened to run inside that environment. I personally believe that the solution to this current parallel feature should be to put even more restrictions to disallow parallel outputs to the same map, if possible at all. I see this parallelism as an issue, not as a feature. From this perspective, this PR would not be ideal because it would open the door to more misuse cases and issues like @mlennert and @metzm mentioned.

Why don't we instead implement a feature that can seamlessly merge mapsets? Then, we would be able to keep the current locking paradigm and allow multiple separate processes on different mapsets, and later just merge the results into one mapset? Just an idea.

@petrasovaa
Copy link
Contributor

GRASS is a powerful tool for big data processing and I think we should promote that. Part of that is I think reducing the barriers to using it in HPCs and similar environments. Like with everything, that unfortunately does open ways to 'misuse'. Since the ability to run parallel jobs in single mapset is already there, I think it's better to expose it and properly document rather than to go with the old way of environment variables, which seem rather error-prone and cumbersome to me.

I think disabling parallel processing within mapset completely is not very practical idea at this point and we would have to compensate for that with tools which we don't have now.

I think the potential misuse could be very well addressed by documenting the cases when this is perfectly safe, for example, don't write to db, don't modify region, for everything else use separate mapsets. Running parallel processes is tricky, everybody knows that, so having good documentation is important and if user doesn't read it, it's their problem.

Disclaimer, I am biased here, I would like to use this workflow on HPC.

@mlennert
Copy link
Contributor

Interesting discussion indeed ;-) I wonder whether some generational differences are at work here...

One of the underlying postulates that I see here is that the current functioning is a "barrier" to using GRASS in HPC and that the use of GRASS_BATCH_JOB is "old" (should I hear "to be deprecated").

Personally I find GRASS in its current state very easy to use in an HPC environment, and when I say that I mean without using --exec which I find a very interesting addition, but haven't really felt the need for up to now.

So maybe it would be helpful to get some more information about what exactly you find these barriers to be ? Is it just the fact that environment variables are used ? Are environment variables a thing of the past ?

@petrasovaa
Copy link
Contributor

Interesting discussion indeed ;-) I wonder whether some generational differences are at work here...

One of the underlying postulates that I see here is that the current functioning is a "barrier" to using GRASS in HPC and that the use of GRASS_BATCH_JOB is "old" (should I hear "to be deprecated").

Personally I find GRASS in its current state very easy to use in an HPC environment, and when I say that I mean without using --exec which I find a very interesting addition, but haven't really felt the need for up to now.

So maybe it would be helpful to get some more information about what exactly you find these barriers to be ? Is it just the fact that environment variables are used ? Are environment variables a thing of the past ?

No, but to me there are too low-level and just less convenient, I prefer using command line parameters. Environment variables are error prone (e.g. misspelling), setting them depends on the particular shell. Parameters are typically well defined and they seem like a natural way of controlling a program. I think using --exec is convenient also outside of HPC, for some external programs for example.

@mlennert
Copy link
Contributor

Interesting discussion indeed ;-) I wonder whether some generational differences are at work here...

One of the underlying postulates that I see here is that the current functioning is a "barrier" to using GRASS in HPC and that the use of GRASS_BATCH_JOB is "old" (should I hear "to be deprecated").

Personally I find GRASS in its current state very easy to use in an HPC environment, and when I say that I mean without using --exec which I find a very interesting addition, but haven't really felt the need for up to now.

So maybe it would be helpful to get some more information about what exactly you find these barriers to be ? Is it just the fact that environment variables are used ? Are environment variables a thing of the past ?

No, but to me there are too low-level and just less convenient, I prefer using command line parameters. Environment variables are error prone (e.g. misspelling), setting them depends on the particular shell. Parameters are typically well defined and they seem like a natural way of controlling a program.

Env variables are so much easier in my eyes because you only have to set them once and not worry about them again. You have to repeat parameters every single time you call the executable... Just a question of perspective I guess.

And if you don't want to worry about setting env variables, you can just let the grass startup script do it for you (although admittedly you do have to set GRASS_BATCH_JOB).

I think using --exec is convenient also outside of HPC, for some external programs for example.

I actually saw --exec usage essentially fir third-party software, e.g. making GRASS calls easier for the QGIS processing toolbox, or possibly a workflow where you just have one or two calls to GRASS amongst many calls to other tools. As already mentioned, I haven't found the itch it would scratch for me in my work, but that's maybe mostly the stubbornness of age playing out ;-)

@wenzeslaus
Copy link
Member Author

I think using --exec is convenient also outside of HPC, for some external programs for example.

I actually saw --exec usage essentially fir third-party software, e.g. making GRASS calls easier for the QGIS processing toolbox, or possibly a workflow where you just have one or two calls to GRASS amongst many calls to other tools.

@mlennert As your comment suggests there is no clear distinction between a third-party software and some user's script/workflow. There is no reason why GRASS GIS power users should be bared of the opportunity to use --exec when they need more flexibility with cleaning and locking. This PR is aiming at leveling the playing field among the different interfaces and environments.

As for third-party software providing interface to GRASS GIS, this PR is not aiming at this use case, but these still benefit because this allows to do operations in a mapset while managing the locking and cleaning differently, for example, there is no need to clean a temporary mapset or the application runs a computation using one grass ... --exec and periodically checks results using another grass ... --exec.

To give an additional perspective, to make --exec even more useful for tools which manage the GRASS database such as the GRASS plugin for QGIS, another, more radical flag would be needed to, e.g., peek into a locked mapset. This --ignore-lock would be something in between the existing -f (force lock removal) and --no-lock (do not lock mapset) as it would not break into the mapset like the current -f does, but it would not require all processes to agree on locking or not locking (which is what --no-lock does).

@neteler neteler added this to the 8.2.0 milestone Dec 9, 2021
@wenzeslaus wenzeslaus modified the milestones: 8.2.0, 8.4.0 Feb 27, 2022
@wenzeslaus wenzeslaus modified the milestones: 8.3.0, 8.4.0 Feb 10, 2023
@echoix echoix marked this pull request as draft March 21, 2024 22:08
@echoix
Copy link
Member

echoix commented Mar 21, 2024

Is the discussion on this PR still relevant today? Have the positions changed? I converted this as a draft as it can’t be updated as is to the current code here, but I’d like to hear what you think of this today: something that needs to be decided on? Or just leave the current code as is and close the PR, as it isn’t planned to integrate this?

@echoix echoix added conflicts/needs rebase Rebase to or merge with the latest base branch is needed Python Related code is in Python info needed Waiting on more info from the submitter module labels Mar 21, 2024
@wenzeslaus wenzeslaus modified the milestones: 8.4.0, Future Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conflicts/needs rebase Rebase to or merge with the latest base branch is needed enhancement New feature or request info needed Waiting on more info from the submitter module Python Related code is in Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants