Skip to content

various problems with STDDEV() when numbers are equals #2029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alaingdl opened this issue Apr 23, 2025 · 6 comments
Open

various problems with STDDEV() when numbers are equals #2029

alaingdl opened this issue Apr 23, 2025 · 6 comments
Assignees

Comments

@alaingdl
Copy link
Contributor

STDDEV(array) should return 0.0 when all input are equal.

This problem is strongly linked with previous issue #2028 since we do used internal MOMENT() in stddev.pro in src/pro
but we should be able to have more simple way to minimize the problem

GDL> stddev(replicate(2.1, 1000))
   1.4312272E-05
GDL> stddev(replicate(2.1, 10000))
   0.00021124039
GDL> stddev(replicate(2.1d, 1000))
   3.8655093638078647E-14
GDL> stddev(replicate(2.1d, 10000))
   1.1991008231352767E-13

All values here should be equal to zero here !
The trick may be to test whether Min() and Max() are equal then shortcut the computation ?

What gives IDL 8.8 ?

IDL> stddev(replicate(2.1, 1000))
       0.0000000
IDL> stddev(replicate(2.1, 10000))
            -NaN
% Program caused arithmetic error: Floating illegal operand
IDL> stddev(replicate(2.1d, 1000))
       0.0000000000000000
IDL> stddev(replicate(2.1d, 10000))
       0.0000000000000000

Look like and interesting bug in IDL ?!

@GillesDuvert
Copy link
Contributor

@alaingdl I mention this for newbies, as you know this pretty well, the stddev results in gdl are 'correct' a they show the numeric accuracy of a series of 10000 (resp. 10000) sums and divide of floating point values.
This acccuracy is better with doubles

GDL> stddev(replicate(2.1d, 1000))
   3.8655093638078647E-14

To reproduce (and this is true for all other GDL commands involving floating-point adds/mults etc) exactly the IDL results, one would have not only to use the same machine-compied, optimised, algorithm (same order of adds, divide, etc at assembly level) but also in many cases the same amount of operations in each one of the parallel threads when OMP_NUM_THREADS is > 1. In other words: impossible.

However, you should be fair with GDL, you have pinpointed a case where IDL and GDL differ, but

IDL> stddev(replicate(2.1, 99766)) 
   5.2423966e-06

and not 0 !

So you are right, to get 0 in some cases IDL must have taken precautions for this special case.
I note that

DL> stddev(replicate(2.1, 9993))
   1.2930187e-06
IDL> stddev(replicate(2.1, 9994))
            -NaN

and

IDL> stddev(replicate(21., 9976688))
     0.071825057
IDL> stddev(replicate(21., 9976689))
            -NaN

So this is not a "all values are equal" test thing. The returned "NaN" (a bug indeed) would be a signature of the (apparently not completely successful!) attempt to a posteriori do something if the returned stddev is smaller than the expected accuracy (depending on number of atomic operations, single or double).

A 'simple' solution would be to do all computations in extended precision (we have no 128 bits double doubles, so this would work only for floats:)
z=stddev(replicate(2.1, 99766)) replace by

 zd=stddev(replicate(2.1d, 99766))
FLOAT_EPS=1.1920929e-07 ; see machar()
if zd le FLOAT_EPS then z=0 else z=zd
; easy but slower of course ^))

@GillesDuvert
Copy link
Contributor

@alaingdl BTW thanks for your self-assignments!

@alaingdl
Copy link
Contributor Author

Thanks for the message !

Yes I am quite aware of numerical accuracy ! (it is a pity, Coyote web site seems to be off, we have here a "the sky is falling" on that)

Yesterday I spend some time looking at the code for MOMENT() which is really impressive !

Without consideration of time, we would have a simple way to "solve" the problem, testing if the max of the array is equal to the min ! But it is quite expensive, needing to recourse the array looking for max & min. Maybe we can do that only when
the result is really close to EPS ? Or close to a given level ? In order to redo it if and only if we are in such a special case.

@alaingdl
Copy link
Contributor Author

hum hum hum, IDL just introduced RUNNING_STATS

I also look at https://www.nv5geospatialsoftware.com/docs/accuracy_and_floating_po.html
and yes, for 2.1, we are in a case where 2.1 and 2.1d are different ...
Nevertheless we have still the problems when all inputs are equal, and the values for Skewness and Kurtosis to be at NaN

@GillesDuvert
Copy link
Contributor

FYI I've done (as many others I suppose) a local working copy of coyote's site.

@GillesDuvert
Copy link
Contributor

Here may not be the good place to discuss that, but I do not think GDL should reproduce IDL's behaviour at all costs:

  • first, we use, when uncopyrighted code source is available, the more efficient code possible, that is always more recent than the one used by IDL (at least more recent than Numerical Recipes, that specialists claim to be completely obsolete). For example, the triangulation, the fft, the median filtering, the parallel mersenne twister... different code, different resulting precision.
  • second, we should use only one C++ code, the best one, for all "related" IDL commands. Example: Should RUNNING_STATS be implemented, it should share the same related code as MOMENT, MEAN, SIGMA, etc. This to get the best results, and to simplify the maintainance of our code (OK, we are not yet there). I note that the need for more statistically robust results is already covered by several procedures in IDLAstro (robust_sigma.pro etc). Hopefully people realyy keen about the accuracy would already have used them.
    (Besides, there are known means to preserve accuracy when computing basic sums, divide etc: just that it takes forever if you use them.)
  • third, IDL has undocumented features that we must support to make 'old' code run, this point is opposite to the previous 2.
  • And, the default value of !version.release returned by GDL, that modifies the behaviour of 'legacy' codes, will always be a problem.

It would be possible to add a global switch that would put GDL in an 'IDL enhanced compatibility' at the expense of a much slower GDL and a complication of the sources and their maintainance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants